feat: add Spark-compatible math functions (bround, greatest, least, hex, unhex) by XuQianJin-Stars · Pull Request #7122 · Eventual-Inc/Daft

XuQianJin-Stars · 2026-06-13T03:43:00Z

Changes Made

This PR adds five Spark-compatible math functions to Daft, callable from
both the Python DataFrame API and Daft SQL. All semantics match Spark
exactly so that workloads migrating from Spark / PySpark can run
unchanged.

Functions added

Function	Signature	Behavior
`bround`	`bround(expr, d=0)`	HALF_EVEN (banker's) rounding. Supports negative `d`. NULL passthrough.
`greatest`	`greatest(*exprs)`	N-ary row-wise max, skipping NULLs. Returns NULL iff all inputs are NULL.
`least`	`least(*exprs)`	N-ary row-wise min, skipping NULLs. Returns NULL iff all inputs are NULL.
`hex`	`hex(expr)`	Integer / String / Binary → uppercase hex string. Negatives use 64-bit two's-complement.
`unhex`	`unhex(expr)`	Hex string → binary. Odd-length input is left-padded with `'0'`. Invalid input → NULL.

Implementation details

Rust core under src/daft-functions/src/:
- numeric/bround.rs — uses f64::round_ties_even to bit-exactly
  match Spark's HALF_EVEN mode; integer inputs short-circuit.
- numeric/hex.rs — Hex and Unhex ScalarUDFs covering Int*/UInt*/
  Utf8/Binary, with two's-complement encoding for negative integers.
- greatest.rs — single ScalarUDF reused for both greatest and
  least via a is_least flag; computes the supertype of all inputs
  and skips NULLs row-wise.
Registration: auto-registered as ScalarUDFs in NumericFunctions
and the global function registry, so SQL picks them up for free
(SELECT bround(x, 2), SELECT greatest(a, b, c), etc.).
Python: wrappers in daft/functions/numeric.py, exported from
daft.functions so users can do
from daft.functions import bround, greatest, least, hex, unhex.
Banker's rounding: relies on Rust 1.77+ round_ties_even
(already in the project's MSRV).

Behavior examples

from daft.functions import bround, greatest, least, hex, unhex
from daft import lit, col

df.select(bround(lit(2.5)))         # -> 2.0   (HALF_EVEN, not 3.0)
df.select(bround(lit(3.5)))         # -> 4.0
df.select(bround(lit(1.235), 2))    # -> 1.24

df.select(greatest(col("a"), col("b"), col("c")))  # NULL-skipping
df.select(least(col("a"), col("b")))               # type-promoted

df.select(hex(lit(-1)))             # -> "FFFFFFFFFFFFFFFF"
df.select(hex(lit("abc")))          # -> "616263"
df.select(unhex(lit("F")))          # -> b"\x0f"  (odd-length padded)
df.select(unhex(lit("zz")))         # -> NULL     (invalid)

Verification

✅ cargo check (workspace) — no errors
✅ cargo fmt --check (workspace) — clean
✅ cargo clippy -p daft-functions — no errors, no new warnings
✅ Python wrappers import-tested via daft.functions

Related Issues

Closes #7121

greptile-apps · 2026-06-13T03:47:57Z

Greptile Summary

Adds five Spark-compatible scalar functions — bround, greatest, least, hex, and unhex — as Rust ScalarUDF implementations registered in both the numeric function module and the global SQL registry, with Python wrappers exported from daft.functions.

bround uses f64::round_ties_even for banker's rounding and correctly short-circuits integers by promoting them to float first; negative precision is supported but uses a non-representable f64 reciprocal multiplier that could introduce avoidable rounding error on some inputs.
greatest/least share a single compare_inputs helper for NULL-skipping row-wise min/max, but NaN values in float columns are not handled specially — Spark treats NaN as greater than any other float, while IEEE 754 comparisons (used here) make NaN invisible, producing different results.
hex/unhex implement byte-level encoding cleanly with correct two's-complement handling for negative integers and proper odd-length left-padding for unhex.

Confidence Score: 3/5

The hex/unhex and bround implementations are functionally solid for most inputs, but greatest/least silently produces wrong results for float columns containing NaN, contradicting the Spark-compatible guarantee this PR advertises.

The greatest/least implementation uses raw IEEE 754 comparisons which return false when either operand is NaN. Spark explicitly documents NaN as the maximum value in greatest/least — any float column with NaN values will produce different answers here than in Spark. This is a quiet data-correctness divergence in a feature explicitly positioned as a Spark compatibility layer.

src/daft-functions/src/greatest.rs needs the most attention — specifically the compare_inputs function and its treatment of NaN in float comparisons.

Important Files Changed

Filename	Overview
src/daft-functions/src/greatest.rs	New file implementing Greatest and Least ScalarUDFs; logic is sound for non-NaN inputs but NaN handling diverges from Spark semantics, and two separate structs could be one parametrised type.
src/daft-functions/src/numeric/bround.rs	New bround ScalarUDF using f64::round_ties_even; negative-precision path uses a non-representable f64 reciprocal multiplier which may introduce avoidable rounding errors for some inputs.
src/daft-functions/src/numeric/hex.rs	New Hex/Unhex ScalarUDFs; integer two's-complement encoding, string/binary byte encoding, and odd-length unhex padding all look correct and well-tested.
daft/functions/numeric.py	Python wrappers for all five new functions; the exported hex name shadows Python's built-in, which should be documented.
daft/functions/init.py	Correct alphabetical placement of all five new exports in both the import block and all.
src/daft-functions/src/numeric/mod.rs	BRound, Hex, and Unhex correctly registered with NumericFunctions; mod declarations added.
src/daft-functions/src/lib.rs	Adds greatest module declaration; Greatest and Least are registered directly, consistent with the coalesce pattern.
src/lib.rs	Greatest and Least added to the global function registry alongside existing Coalesce registration; straightforward and correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    PY["Python: daft.functions\nbround / greatest / least / hex / unhex"]
    REG["Global FunctionRegistry\n(src/lib.rs)"]
    NUM["NumericFunctions module\n(numeric/mod.rs)"]
    SQL["SQL engine\n(auto-resolves by name)"]

    PY -->|"_call_builtin_scalar_fn(name, ...)"| REG
    REG --> SQL

    BRound["BRound ScalarUDF\nnumeric/bround.rs\nf64::round_ties_even"]
    HEX["Hex / Unhex ScalarUDF\nnumeric/hex.rs\nbytes_to_hex_upper / decode_hex_padded"]
    GL["Greatest / Least ScalarUDF\ngreatest.rs\ncompare_inputs(keep_greater)"]

    NUM --> BRound
    NUM --> HEX
    REG --> GL

    BRound -->|"series_bround()"| BOUT["Float output"]
    HEX -->|"hex_impl / unhex_impl"| HOUT["Utf8 / Binary output"]
    GL -->|"try_get_collection_supertype + if_else"| GLOUT["Supertype output\nNULL if all-NULL row"]

Comments Outside Diff (2)

src/daft-functions/src/greatest.rs, line 295-313 (link)

NaN handling differs from Spark semantics

In Spark, NaN is treated as larger than any other numeric value for greatest/least on floating-point columns (documented in org.apache.spark.sql.catalyst.expressions.Greatest). The current implementation uses standard IEEE 754 comparisons (gt/lt), where NaN > x and NaN < x both return false. As a result, greatest(1.0, NaN) returns 1.0 here but NaN in Spark, silently diverging from the advertised Spark-compatible semantics whenever a column contains NaN values.
daft/functions/numeric.py, line 76-93 (link)

hex shadows Python's built-in

hex is a Python built-in function. Exporting it from daft.functions means any user who writes from daft.functions import hex or from daft.functions import * will silently lose access to Python's built-in hex. Consider adding a note to the docstring warning about this shadowing, or using the qualified name daft.functions.hex in examples.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (1): Last reviewed commit: "feat: add Spark-compatible math function..." | Re-trigger Greptile}

greptile-apps · 2026-06-13T03:48:01Z

+#[derive(Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct Greatest;
+
+#[typetag::serde]
+impl ScalarUDF for Greatest {
+    fn name(&self) -> &'static str {
+        "greatest"
+    }
+
+    /// Returns the largest of the input values per row, ignoring NULLs.
+    /// Returns NULL only when all inputs in a row are NULL. Mirrors Spark's
+    /// `Greatest` semantics (`org.apache.spark.sql.catalyst.expressions.Greatest`).
+    fn call(
+        &self,
+        inputs: FunctionArgs<Series>,
+        _ctx: &daft_dsl::functions::scalar::EvalContext,
+    ) -> DaftResult<Series> {
+        compare_inputs(inputs, /* keep_greater = */ true)
+    }
+
+    fn get_return_field(
+        &self,
+        inputs: FunctionArgs<ExprRef>,
+        schema: &Schema,
+    ) -> DaftResult<Field> {
+        common_return_field(self.name(), inputs, schema)
+    }
+
+    fn docstring(&self) -> &'static str {
+        "Returns the largest value among the inputs, skipping NULL values. \
+         Returns NULL only if all inputs are NULL. Requires at least one argument."
+    }
+}
+
+#[derive(Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct Least;
+
+#[typetag::serde]
+impl ScalarUDF for Least {
+    fn name(&self) -> &'static str {
+        "least"
+    }
+
+    /// Returns the smallest of the input values per row, ignoring NULLs.
+    /// Returns NULL only when all inputs in a row are NULL. Mirrors Spark's
+    /// `Least` semantics (`org.apache.spark.sql.catalyst.expressions.Least`).
+    fn call(
+        &self,
+        inputs: FunctionArgs<Series>,
+        _ctx: &daft_dsl::functions::scalar::EvalContext,
+    ) -> DaftResult<Series> {
+        compare_inputs(inputs, /* keep_greater = */ false)
+    }
+
+    fn get_return_field(
+        &self,
+        inputs: FunctionArgs<ExprRef>,
+        schema: &Schema,
+    ) -> DaftResult<Field> {
+        common_return_field(self.name(), inputs, schema)
+    }
+
+    fn docstring(&self) -> &'static str {
+        "Returns the smallest value among the inputs, skipping NULL values. \
+         Returns NULL only if all inputs are NULL. Requires at least one argument."
+    }
+}


Two separate structs where one parametrised struct would suffice

Greatest and Least are structurally identical — the only difference is which comparison direction is passed to compare_inputs. Per the codebase's convention, similar functionality with different variants should be expressed as a single parametrised type (e.g. struct GreatestLeast { is_greatest: bool }) rather than two separate structs. The internal helper compare_inputs(inputs, keep_greater) already has the right shape; the public structs just add boilerplate.

Rule Used: Prefer single parametrized functions over multiple... (source)

Learned From
Eventual-Inc/Daft#5207

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-13T03:48:02Z

+fn f64_bround(arr: &Float64Array, precision: i32) -> DaftResult<Float64Array> {
+    if precision == 0 {
+        arr.apply(|v| v.round_ties_even())
+    } else {
+        let multiplier: f64 = 10.0f64.pow(precision);
+        arr.apply(|v| (v * multiplier).round_ties_even() / multiplier)
+    }


Negative precision uses a non-representable f64 multiplier

For negative precision (e.g. -2), multiplier = 10.0f64.pow(-2) = 0.01, which is not exactly representable in f64. The round-trip (v * 0.01).round_ties_even() / 0.01 can accumulate two rounding errors (once on multiply, once on divide). Using integer division/multiplication — (v / 10.0f64.powi(-precision)).round_ties_even() * 10.0f64.powi(-precision) — avoids multiplying by a non-representable reciprocal and is the more numerically stable approach for negative precisions.

greptile-apps · 2026-06-13T03:48:04Z

+    fn call_bround(input: Series, decimals: Option<i32>) -> Series {
+        let mut args = vec![input.clone()];
+        if let Some(d) = decimals {
+            args.push(
+                Int64Array::from_iter(
+                    Field::new("d", DataType::Int64),
+                    vec![Some(d as i64)].into_iter(),
+                )
+                .into_series(),
+            );
+        }
+        let ctx = EvalContext {
+            row_count: input.len(),
+        };
+        BRound {}
+            .call(FunctionArgs::new_unnamed(args), &ctx)
+            .unwrap()
+    }
+


Test helper passes Int64 for a field declared as Option<i32>

The BRoundArgs struct declares decimals: Option<i32>, yet the test helper pushes an Int64Array as the second argument. It would be worth verifying that calling bround via the SQL engine with an integer literal also works, since SQL may infer the literal as Int64. Adding a test case that passes the decimal as Int32 directly would make the contract explicit.

srilman

@XuQianJin-Stars I believe we already have greatest and least, they are just named columns_min and columns_max. Maybe not the best names, and it would make sense to have greatest and least as aliases, but we dont need I custom implementation I believe

…ex, unhex) - bround(expr, d): HALF_EVEN (banker's) rounding, supports negative d - greatest(*exprs) / least(*exprs): N-ary row-wise min/max, skipping NULLs - hex(expr): integer/string/binary -> uppercase hex (negatives use 64-bit two's complement) - unhex(expr): hex string -> binary, odd-length left-padded with '0', invalid input returns NULL Auto-registered as ScalarUDF in NumericFunctions / global registry, exposed via daft.functions, and unit-tested (17 tests).

- numeric.py: use raw docstring for unhex() to satisfy ruff D301 (escape sequence \x0f in a regular docstring) - bround.rs: avoid multiplying by non-exact powers of 10 for negative precisions. precision >= 0 keeps multiply-then-divide; precision < 0 now divides-then-multiplies by 10^|p|, eliminating the second rounding step that biased ties (e.g. 250 with d=-2). Add regression test plus an Int32 decimals literal test for BRoundArgs<Option<i32>>. - greatest.rs: deduplicate Greatest/Least via a private GreatestLeastKind trait carrying KEEP_GREATER/NAME/DOCSTRING. Both ScalarUDF impls now share compare_inputs through impl_call::<Self>; public unit structs and typetag::serde tags are preserved for backwards compatibility.

XuQianJin-Stars · 2026-06-20T03:17:58Z

@XuQianJin-Stars I believe we already have greatest and least, they are just named columns_min and columns_max. Maybe not the best names, and it would make sense to have greatest and least as aliases, but we dont need I custom implementation I believe

Thanks @srilman! After checking the NULL semantics I confirmed list_min / list_max and the new greatest / least behave identically (skip NULLs row-wise; return NULL only when all inputs are NULL — verified in tests/recordbatch/list/test_list_numeric_aggs.py).
However, list_min / list_max are restricted to numeric/boolean dtypes (min.rs / max.rs enforce is_numeric() || is_boolean()), while greatest / least work on any type with a comparison kernel (Utf8, Date, Timestamp, etc.) and avoid materializing an intermediate list.
So instead of removing the new implementation, I've rewritten columns_min / columns_max as thin wrappers around least / greatest in daft/functions/columnar.py. This:
keeps the public columns_min / columns_max API unchanged (back-compat),
removes the duplication you flagged,
and as a side benefit, lifts the numeric/boolean restriction on columns_min / columns_max for free.
Let me know if you'd prefer to flip the relationship instead (i.e. drop greatest / least and only expose columns_*) — happy to adjust.

Address review feedback: replace the list-based implementation (to_list().list_min()/list_max()) with direct delegation to the existing greatest/least scalar functions. Benefits: - Row-wise NULL skipping: result is NULL only when all inputs in a row are NULL, matching Spark Greatest/Least semantics. - Works on any comparable dtype (numeric, boolean, string, temporal), not just types supported by list aggregation. - Avoids the overhead of constructing an intermediate list column per row. Public surface is preserved: aliases (columns_min/columns_max) and empty-arg error messages remain unchanged. All 80 existing tests/dataframe/test_horizontal.py cases pass.

github-actions Bot added the feat label Jun 13, 2026

greptile-apps Bot reviewed Jun 13, 2026

View reviewed changes

srilman requested changes Jun 18, 2026

View reviewed changes

XuQianJin-Stars added 2 commits June 20, 2026 11:09

XuQianJin-Stars force-pushed the feat/spark-math-functions branch from 7d4ffa0 to d719636 Compare June 20, 2026 03:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Spark-compatible math functions (bround, greatest, least, hex, unhex)#7122

feat: add Spark-compatible math functions (bround, greatest, least, hex, unhex)#7122
XuQianJin-Stars wants to merge 3 commits into
Eventual-Inc:mainfrom
XuQianJin-Stars:feat/spark-math-functions

XuQianJin-Stars commented Jun 13, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

srilman left a comment

Uh oh!

XuQianJin-Stars commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XuQianJin-Stars commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Functions added

Implementation details

Behavior examples

Verification

Related Issues

Uh oh!

greptile-apps Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

srilman left a comment

Choose a reason for hiding this comment

Uh oh!

XuQianJin-Stars commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

XuQianJin-Stars commented Jun 13, 2026 •

edited

Loading

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading