Is your feature request related to a problem?
Daft already provides a rich set of list_* functions, but several commonly used array operations from Apache Spark / Databricks SQL are still missing. This makes it harder for users to:
- Migrate existing Spark / PySpark workloads to Daft.
- Write portable SQL that works across both engines.
- Express common set-style list manipulations (set difference / intersection / union / dedup, and finding an element's position) without resorting to verbose UDFs or workarounds.
Specifically, the following Spark functions currently have no direct equivalent in Daft:
| Spark function |
Behavior |
array_except |
Returns elements in array1 that are not in array2 (set difference, deduped) |
array_intersect |
Returns the set intersection of two arrays (deduped) |
array_union |
Returns the set union of two arrays (deduped) |
array_compact |
Removes NULL elements from an array |
array_position |
Returns the 1-based position of the first occurrence of an element, 0 if not found |
These appear in the SPARK_FUNCTION_COMPARISON.md matrix as unimplemented.
No related issue ID — this is a fresh proposal, but it follows the ongoing effort tracked in SPARK_FUNCTION_COMPARISON.md to close the Spark-compatibility gap.
Describe the solution you'd like
Implement five new list/array functions in daft-functions-list, expose them via the Python expression API, the top-level daft.functions module, and the SQL layer with Spark-compatible aliases.
Behavior (Spark-compatible)
array_except(a, b) / list_except — set difference, deduplicated, NULLs dropped, preserves first-seen order from a.
array_intersect(a, b) / list_intersect — set intersection, deduplicated, NULLs dropped, preserves first-seen order from a.
array_union(a, b) / list_union — set union, deduplicated, NULLs dropped, order = elements of a followed by new elements of b.
array_compact(a) / list_compact — returns the input list with all NULL elements removed.
array_position(a, item) / list_position — returns the 1-based index of the first occurrence of item in a, 0 if not found, NULL if either input is NULL.
Surfaces to update
- ✅ Rust kernels in
src/daft-functions-list/src/{except,intersect,union,compact,position}.rs
- ✅ Hash-based set semantics in
src/daft-functions-list/src/series.rs (with proper NULL handling matching Spark)
- ✅ Python expressions:
daft.Expression.list.except_/intersect/union/compact/position
- ✅ Top-level functions:
daft.functions.list_except / list_intersect / list_union / list_compact / list_position
- ✅ SQL aliases:
array_except, array_intersect, array_union, array_compact, array_position (plus list_* equivalents)
- ✅ Tests under
tests/recordbatch/list/test_list_set_ops.py covering Python API + SQL paths, NULL handling, dedup, type promotion, empty arrays, and not-found cases.
- ✅ Update
SPARK_FUNCTION_COMPARISON.md to mark these as implemented.
Type semantics
- All five functions promote element types via the existing
try_supertype machinery so array_union(int_list, float_list) works correctly.
- Output element nullability is preserved consistently with other list functions.
Implementation note
Set operations are implemented with a hashed probe table over the right-hand list (mirroring the existing list_distinct approach), giving O(N + M) behavior per row instead of O(N·M).
PR: https://github.com/Eventual-Inc/Daft/pull/new/feat/spark-array-functions
Branch: feat/spark-array-functions
Describe alternatives you've considered
-
Pure Python UDFs — users could compose list_distinct, list_value_counts, is_in, list_filter, etc. to emulate these. Rejected because:
- It produces verbose, error-prone code (especially for NULL semantics, which differ subtly between Spark and naive set operations).
- It bypasses native vectorized execution, costing significant performance.
- It does not solve the SQL-compatibility gap —
array_* SQL identifiers still need to be resolved by the planner.
-
Implement only the SQL-side aliases that map to existing list_* functions — Rejected because Daft does not yet have native equivalents for except / intersect / union / compact / position semantics as Spark defines them (set-deduplicated with NULL drop). Aliasing alone is insufficient.
-
Wait for Substrait / external library integration to provide these — Rejected because these are simple, foundational list primitives that should live in daft-functions-list next to their existing siblings (list_distinct, list_contains, …) for consistency and discoverability.
The chosen approach (native Rust kernels + Python + SQL) is the same pattern Daft already uses for its existing list functions, so it integrates cleanly with no new abstractions.
Component(s)
No response
Additional Context
No response
Would you like to implement a fix?
Yes
Is your feature request related to a problem?
Daft already provides a rich set of
list_*functions, but several commonly used array operations from Apache Spark / Databricks SQL are still missing. This makes it harder for users to:Specifically, the following Spark functions currently have no direct equivalent in Daft:
array_exceptarray_intersectarray_unionarray_compactarray_positionThese appear in the
SPARK_FUNCTION_COMPARISON.mdmatrix as unimplemented.No related issue ID — this is a fresh proposal, but it follows the ongoing effort tracked in
SPARK_FUNCTION_COMPARISON.mdto close the Spark-compatibility gap.Describe the solution you'd like
Implement five new list/array functions in
daft-functions-list, expose them via the Python expression API, the top-leveldaft.functionsmodule, and the SQL layer with Spark-compatible aliases.Behavior (Spark-compatible)
array_except(a, b)/list_except— set difference, deduplicated, NULLs dropped, preserves first-seen order froma.array_intersect(a, b)/list_intersect— set intersection, deduplicated, NULLs dropped, preserves first-seen order froma.array_union(a, b)/list_union— set union, deduplicated, NULLs dropped, order = elements ofafollowed by new elements ofb.array_compact(a)/list_compact— returns the input list with all NULL elements removed.array_position(a, item)/list_position— returns the 1-based index of the first occurrence ofitemina,0if not found, NULL if either input is NULL.Surfaces to update
src/daft-functions-list/src/{except,intersect,union,compact,position}.rssrc/daft-functions-list/src/series.rs(with proper NULL handling matching Spark)daft.Expression.list.except_/intersect/union/compact/positiondaft.functions.list_except / list_intersect / list_union / list_compact / list_positionarray_except,array_intersect,array_union,array_compact,array_position(pluslist_*equivalents)tests/recordbatch/list/test_list_set_ops.pycovering Python API + SQL paths, NULL handling, dedup, type promotion, empty arrays, and not-found cases.SPARK_FUNCTION_COMPARISON.mdto mark these as implemented.Type semantics
try_supertypemachinery soarray_union(int_list, float_list)works correctly.Implementation note
Set operations are implemented with a hashed probe table over the right-hand list (mirroring the existing
list_distinctapproach), giving O(N + M) behavior per row instead of O(N·M).PR: https://github.com/Eventual-Inc/Daft/pull/new/feat/spark-array-functions
Branch:
feat/spark-array-functionsDescribe alternatives you've considered
Pure Python UDFs — users could compose
list_distinct,list_value_counts,is_in,list_filter, etc. to emulate these. Rejected because:array_*SQL identifiers still need to be resolved by the planner.Implement only the SQL-side aliases that map to existing
list_*functions — Rejected because Daft does not yet have native equivalents forexcept/intersect/union/compact/positionsemantics as Spark defines them (set-deduplicated with NULL drop). Aliasing alone is insufficient.Wait for Substrait / external library integration to provide these — Rejected because these are simple, foundational list primitives that should live in
daft-functions-listnext to their existing siblings (list_distinct,list_contains, …) for consistency and discoverability.The chosen approach (native Rust kernels + Python + SQL) is the same pattern Daft already uses for its existing list functions, so it integrates cleanly with no new abstractions.
Component(s)
No response
Additional Context
No response
Would you like to implement a fix?
Yes