Skip to content

feat: add Spark-compatible math functions (bround, greatest, least, hex, unhex)#7122

Open
XuQianJin-Stars wants to merge 3 commits into
Eventual-Inc:mainfrom
XuQianJin-Stars:feat/spark-math-functions
Open

feat: add Spark-compatible math functions (bround, greatest, least, hex, unhex)#7122
XuQianJin-Stars wants to merge 3 commits into
Eventual-Inc:mainfrom
XuQianJin-Stars:feat/spark-math-functions

Conversation

@XuQianJin-Stars

@XuQianJin-Stars XuQianJin-Stars commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Changes Made

This PR adds five Spark-compatible math functions to Daft, callable from
both the Python DataFrame API and Daft SQL. All semantics match Spark
exactly so that workloads migrating from Spark / PySpark can run
unchanged.

Functions added

Function Signature Behavior
bround bround(expr, d=0) HALF_EVEN (banker's) rounding. Supports negative d. NULL passthrough.
greatest greatest(*exprs) N-ary row-wise max, skipping NULLs. Returns NULL iff all inputs are NULL.
least least(*exprs) N-ary row-wise min, skipping NULLs. Returns NULL iff all inputs are NULL.
hex hex(expr) Integer / String / Binary → uppercase hex string. Negatives use 64-bit two's-complement.
unhex unhex(expr) Hex string → binary. Odd-length input is left-padded with '0'. Invalid input → NULL.

Implementation details

  • Rust core under src/daft-functions/src/:
    • numeric/bround.rs — uses f64::round_ties_even to bit-exactly
      match Spark's HALF_EVEN mode; integer inputs short-circuit.
    • numeric/hex.rsHex and Unhex ScalarUDFs covering Int*/UInt*/
      Utf8/Binary, with two's-complement encoding for negative integers.
    • greatest.rs — single ScalarUDF reused for both greatest and
      least via a is_least flag; computes the supertype of all inputs
      and skips NULLs row-wise.
  • Registration: auto-registered as ScalarUDFs in NumericFunctions
    and the global function registry, so SQL picks them up for free
    (SELECT bround(x, 2), SELECT greatest(a, b, c), etc.).
  • Python: wrappers in daft/functions/numeric.py, exported from
    daft.functions so users can do
    from daft.functions import bround, greatest, least, hex, unhex.
  • Banker's rounding: relies on Rust 1.77+ round_ties_even
    (already in the project's MSRV).

Behavior examples

from daft.functions import bround, greatest, least, hex, unhex
from daft import lit, col

df.select(bround(lit(2.5)))         # -> 2.0   (HALF_EVEN, not 3.0)
df.select(bround(lit(3.5)))         # -> 4.0
df.select(bround(lit(1.235), 2))    # -> 1.24

df.select(greatest(col("a"), col("b"), col("c")))  # NULL-skipping
df.select(least(col("a"), col("b")))               # type-promoted

df.select(hex(lit(-1)))             # -> "FFFFFFFFFFFFFFFF"
df.select(hex(lit("abc")))          # -> "616263"
df.select(unhex(lit("F")))          # -> b"\x0f"  (odd-length padded)
df.select(unhex(lit("zz")))         # -> NULL     (invalid)

Verification

  • cargo check (workspace) — no errors
  • cargo fmt --check (workspace) — clean
  • cargo clippy -p daft-functions — no errors, no new warnings
  • ✅ Python wrappers import-tested via daft.functions

Related Issues

Closes #7121

@github-actions github-actions Bot added the feat label Jun 13, 2026
@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds five Spark-compatible scalar functions — bround, greatest, least, hex, and unhex — as Rust ScalarUDF implementations registered in both the numeric function module and the global SQL registry, with Python wrappers exported from daft.functions.

  • bround uses f64::round_ties_even for banker's rounding and correctly short-circuits integers by promoting them to float first; negative precision is supported but uses a non-representable f64 reciprocal multiplier that could introduce avoidable rounding error on some inputs.
  • greatest/least share a single compare_inputs helper for NULL-skipping row-wise min/max, but NaN values in float columns are not handled specially — Spark treats NaN as greater than any other float, while IEEE 754 comparisons (used here) make NaN invisible, producing different results.
  • hex/unhex implement byte-level encoding cleanly with correct two's-complement handling for negative integers and proper odd-length left-padding for unhex.

Confidence Score: 3/5

The hex/unhex and bround implementations are functionally solid for most inputs, but greatest/least silently produces wrong results for float columns containing NaN, contradicting the Spark-compatible guarantee this PR advertises.

The greatest/least implementation uses raw IEEE 754 comparisons which return false when either operand is NaN. Spark explicitly documents NaN as the maximum value in greatest/least — any float column with NaN values will produce different answers here than in Spark. This is a quiet data-correctness divergence in a feature explicitly positioned as a Spark compatibility layer.

src/daft-functions/src/greatest.rs needs the most attention — specifically the compare_inputs function and its treatment of NaN in float comparisons.

Important Files Changed

Filename Overview
src/daft-functions/src/greatest.rs New file implementing Greatest and Least ScalarUDFs; logic is sound for non-NaN inputs but NaN handling diverges from Spark semantics, and two separate structs could be one parametrised type.
src/daft-functions/src/numeric/bround.rs New bround ScalarUDF using f64::round_ties_even; negative-precision path uses a non-representable f64 reciprocal multiplier which may introduce avoidable rounding errors for some inputs.
src/daft-functions/src/numeric/hex.rs New Hex/Unhex ScalarUDFs; integer two's-complement encoding, string/binary byte encoding, and odd-length unhex padding all look correct and well-tested.
daft/functions/numeric.py Python wrappers for all five new functions; the exported hex name shadows Python's built-in, which should be documented.
daft/functions/init.py Correct alphabetical placement of all five new exports in both the import block and all.
src/daft-functions/src/numeric/mod.rs BRound, Hex, and Unhex correctly registered with NumericFunctions; mod declarations added.
src/daft-functions/src/lib.rs Adds greatest module declaration; Greatest and Least are registered directly, consistent with the coalesce pattern.
src/lib.rs Greatest and Least added to the global function registry alongside existing Coalesce registration; straightforward and correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    PY["Python: daft.functions\nbround / greatest / least / hex / unhex"]
    REG["Global FunctionRegistry\n(src/lib.rs)"]
    NUM["NumericFunctions module\n(numeric/mod.rs)"]
    SQL["SQL engine\n(auto-resolves by name)"]

    PY -->|"_call_builtin_scalar_fn(name, ...)"| REG
    REG --> SQL

    BRound["BRound ScalarUDF\nnumeric/bround.rs\nf64::round_ties_even"]
    HEX["Hex / Unhex ScalarUDF\nnumeric/hex.rs\nbytes_to_hex_upper / decode_hex_padded"]
    GL["Greatest / Least ScalarUDF\ngreatest.rs\ncompare_inputs(keep_greater)"]

    NUM --> BRound
    NUM --> HEX
    REG --> GL

    BRound -->|"series_bround()"| BOUT["Float output"]
    HEX -->|"hex_impl / unhex_impl"| HOUT["Utf8 / Binary output"]
    GL -->|"try_get_collection_supertype + if_else"| GLOUT["Supertype output\nNULL if all-NULL row"]
Loading

Comments Outside Diff (2)

  1. src/daft-functions/src/greatest.rs, line 295-313 (link)

    P1 NaN handling differs from Spark semantics

    In Spark, NaN is treated as larger than any other numeric value for greatest/least on floating-point columns (documented in org.apache.spark.sql.catalyst.expressions.Greatest). The current implementation uses standard IEEE 754 comparisons (gt/lt), where NaN > x and NaN < x both return false. As a result, greatest(1.0, NaN) returns 1.0 here but NaN in Spark, silently diverging from the advertised Spark-compatible semantics whenever a column contains NaN values.

  2. daft/functions/numeric.py, line 76-93 (link)

    P2 hex shadows Python's built-in

    hex is a Python built-in function. Exporting it from daft.functions means any user who writes from daft.functions import hex or from daft.functions import * will silently lose access to Python's built-in hex. Consider adding a note to the docstring warning about this shadowing, or using the qualified name daft.functions.hex in examples.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "feat: add Spark-compatible math function..." | Re-trigger Greptile

Comment on lines +14 to +80
#[derive(Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct Greatest;

#[typetag::serde]
impl ScalarUDF for Greatest {
fn name(&self) -> &'static str {
"greatest"
}

/// Returns the largest of the input values per row, ignoring NULLs.
/// Returns NULL only when all inputs in a row are NULL. Mirrors Spark's
/// `Greatest` semantics (`org.apache.spark.sql.catalyst.expressions.Greatest`).
fn call(
&self,
inputs: FunctionArgs<Series>,
_ctx: &daft_dsl::functions::scalar::EvalContext,
) -> DaftResult<Series> {
compare_inputs(inputs, /* keep_greater = */ true)
}

fn get_return_field(
&self,
inputs: FunctionArgs<ExprRef>,
schema: &Schema,
) -> DaftResult<Field> {
common_return_field(self.name(), inputs, schema)
}

fn docstring(&self) -> &'static str {
"Returns the largest value among the inputs, skipping NULL values. \
Returns NULL only if all inputs are NULL. Requires at least one argument."
}
}

#[derive(Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct Least;

#[typetag::serde]
impl ScalarUDF for Least {
fn name(&self) -> &'static str {
"least"
}

/// Returns the smallest of the input values per row, ignoring NULLs.
/// Returns NULL only when all inputs in a row are NULL. Mirrors Spark's
/// `Least` semantics (`org.apache.spark.sql.catalyst.expressions.Least`).
fn call(
&self,
inputs: FunctionArgs<Series>,
_ctx: &daft_dsl::functions::scalar::EvalContext,
) -> DaftResult<Series> {
compare_inputs(inputs, /* keep_greater = */ false)
}

fn get_return_field(
&self,
inputs: FunctionArgs<ExprRef>,
schema: &Schema,
) -> DaftResult<Field> {
common_return_field(self.name(), inputs, schema)
}

fn docstring(&self) -> &'static str {
"Returns the smallest value among the inputs, skipping NULL values. \
Returns NULL only if all inputs are NULL. Requires at least one argument."
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Two separate structs where one parametrised struct would suffice

Greatest and Least are structurally identical — the only difference is which comparison direction is passed to compare_inputs. Per the codebase's convention, similar functionality with different variants should be expressed as a single parametrised type (e.g. struct GreatestLeast { is_greatest: bool }) rather than two separate structs. The internal helper compare_inputs(inputs, keep_greater) already has the right shape; the public structs just add boilerplate.

Rule Used: Prefer single parametrized functions over multiple... (source)

Learned From
Eventual-Inc/Daft#5207

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +109 to +115
fn f64_bround(arr: &Float64Array, precision: i32) -> DaftResult<Float64Array> {
if precision == 0 {
arr.apply(|v| v.round_ties_even())
} else {
let multiplier: f64 = 10.0f64.pow(precision);
arr.apply(|v| (v * multiplier).round_ties_even() / multiplier)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Negative precision uses a non-representable f64 multiplier

For negative precision (e.g. -2), multiplier = 10.0f64.pow(-2) = 0.01, which is not exactly representable in f64. The round-trip (v * 0.01).round_ties_even() / 0.01 can accumulate two rounding errors (once on multiply, once on divide). Using integer division/multiplication — (v / 10.0f64.powi(-precision)).round_ties_even() * 10.0f64.powi(-precision) — avoids multiplying by a non-representable reciprocal and is the more numerically stable approach for negative precisions.

Comment on lines +136 to +154
fn call_bround(input: Series, decimals: Option<i32>) -> Series {
let mut args = vec![input.clone()];
if let Some(d) = decimals {
args.push(
Int64Array::from_iter(
Field::new("d", DataType::Int64),
vec![Some(d as i64)].into_iter(),
)
.into_series(),
);
}
let ctx = EvalContext {
row_count: input.len(),
};
BRound {}
.call(FunctionArgs::new_unnamed(args), &ctx)
.unwrap()
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test helper passes Int64 for a field declared as Option<i32>

The BRoundArgs struct declares decimals: Option<i32>, yet the test helper pushes an Int64Array as the second argument. It would be worth verifying that calling bround via the SQL engine with an integer literal also works, since SQL may infer the literal as Int64. Adding a test case that passes the decimal as Int32 directly would make the contract explicit.

@srilman srilman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XuQianJin-Stars I believe we already have greatest and least, they are just named columns_min and columns_max. Maybe not the best names, and it would make sense to have greatest and least as aliases, but we dont need I custom implementation I believe

…ex, unhex)

- bround(expr, d): HALF_EVEN (banker's) rounding, supports negative d - greatest(*exprs) / least(*exprs): N-ary row-wise min/max, skipping NULLs - hex(expr): integer/string/binary -> uppercase hex (negatives use 64-bit two's complement) - unhex(expr): hex string -> binary, odd-length left-padded with '0', invalid input returns NULL Auto-registered as ScalarUDF in NumericFunctions / global registry, exposed via daft.functions, and unit-tested (17 tests).
- numeric.py: use raw docstring for unhex() to satisfy ruff D301
  (escape sequence \x0f in a regular docstring)
- bround.rs: avoid multiplying by non-exact powers of 10 for negative
  precisions. precision >= 0 keeps multiply-then-divide; precision < 0
  now divides-then-multiplies by 10^|p|, eliminating the second rounding
  step that biased ties (e.g. 250 with d=-2). Add regression test plus
  an Int32 decimals literal test for BRoundArgs<Option<i32>>.
- greatest.rs: deduplicate Greatest/Least via a private GreatestLeastKind
  trait carrying KEEP_GREATER/NAME/DOCSTRING. Both ScalarUDF impls now
  share compare_inputs through impl_call::<Self>; public unit structs
  and typetag::serde tags are preserved for backwards compatibility.
@XuQianJin-Stars

Copy link
Copy Markdown
Contributor Author

@XuQianJin-Stars I believe we already have greatest and least, they are just named columns_min and columns_max. Maybe not the best names, and it would make sense to have greatest and least as aliases, but we dont need I custom implementation I believe

Thanks @srilman! After checking the NULL semantics I confirmed list_min / list_max and the new greatest / least behave identically (skip NULLs row-wise; return NULL only when all inputs are NULL — verified in tests/recordbatch/list/test_list_numeric_aggs.py).
However, list_min / list_max are restricted to numeric/boolean dtypes (min.rs / max.rs enforce is_numeric() || is_boolean()), while greatest / least work on any type with a comparison kernel (Utf8, Date, Timestamp, etc.) and avoid materializing an intermediate list.
So instead of removing the new implementation, I've rewritten columns_min / columns_max as thin wrappers around least / greatest in daft/functions/columnar.py. This:
keeps the public columns_min / columns_max API unchanged (back-compat),
removes the duplication you flagged,
and as a side benefit, lifts the numeric/boolean restriction on columns_min / columns_max for free.
Let me know if you'd prefer to flip the relationship instead (i.e. drop greatest / least and only expose columns_*) — happy to adjust.

Address review feedback: replace the list-based implementation
(to_list().list_min()/list_max()) with direct delegation to the
existing greatest/least scalar functions.

Benefits:
- Row-wise NULL skipping: result is NULL only when all inputs in a
  row are NULL, matching Spark Greatest/Least semantics.
- Works on any comparable dtype (numeric, boolean, string, temporal),
  not just types supported by list aggregation.
- Avoids the overhead of constructing an intermediate list column
  per row.

Public surface is preserved: aliases (columns_min/columns_max) and
empty-arg error messages remain unchanged. All 80 existing
tests/dataframe/test_horizontal.py cases pass.
@XuQianJin-Stars XuQianJin-Stars force-pushed the feat/spark-math-functions branch from 7d4ffa0 to d719636 Compare June 20, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add Spark-compatible math functions: bround, greatest, least, hex, unhex

2 participants