Skip to content

feat: Add Spark-compatible array functions: array_except, array_intersect, array_union, array_compact, array_position #7082

Description

@XuQianJin-Stars

Is your feature request related to a problem?

Daft already provides a rich set of list_* functions, but several commonly used array operations from Apache Spark / Databricks SQL are still missing. This makes it harder for users to:

  • Migrate existing Spark / PySpark workloads to Daft.
  • Write portable SQL that works across both engines.
  • Express common set-style list manipulations (set difference / intersection / union / dedup, and finding an element's position) without resorting to verbose UDFs or workarounds.

Specifically, the following Spark functions currently have no direct equivalent in Daft:

Spark function Behavior
array_except Returns elements in array1 that are not in array2 (set difference, deduped)
array_intersect Returns the set intersection of two arrays (deduped)
array_union Returns the set union of two arrays (deduped)
array_compact Removes NULL elements from an array
array_position Returns the 1-based position of the first occurrence of an element, 0 if not found

These appear in the SPARK_FUNCTION_COMPARISON.md matrix as unimplemented.

No related issue ID — this is a fresh proposal, but it follows the ongoing effort tracked in SPARK_FUNCTION_COMPARISON.md to close the Spark-compatibility gap.

Describe the solution you'd like

Implement five new list/array functions in daft-functions-list, expose them via the Python expression API, the top-level daft.functions module, and the SQL layer with Spark-compatible aliases.

Behavior (Spark-compatible)

  • array_except(a, b) / list_except — set difference, deduplicated, NULLs dropped, preserves first-seen order from a.
  • array_intersect(a, b) / list_intersect — set intersection, deduplicated, NULLs dropped, preserves first-seen order from a.
  • array_union(a, b) / list_union — set union, deduplicated, NULLs dropped, order = elements of a followed by new elements of b.
  • array_compact(a) / list_compact — returns the input list with all NULL elements removed.
  • array_position(a, item) / list_position — returns the 1-based index of the first occurrence of item in a, 0 if not found, NULL if either input is NULL.

Surfaces to update

  • ✅ Rust kernels in src/daft-functions-list/src/{except,intersect,union,compact,position}.rs
  • ✅ Hash-based set semantics in src/daft-functions-list/src/series.rs (with proper NULL handling matching Spark)
  • ✅ Python expressions: daft.Expression.list.except_/intersect/union/compact/position
  • ✅ Top-level functions: daft.functions.list_except / list_intersect / list_union / list_compact / list_position
  • ✅ SQL aliases: array_except, array_intersect, array_union, array_compact, array_position (plus list_* equivalents)
  • ✅ Tests under tests/recordbatch/list/test_list_set_ops.py covering Python API + SQL paths, NULL handling, dedup, type promotion, empty arrays, and not-found cases.
  • ✅ Update SPARK_FUNCTION_COMPARISON.md to mark these as implemented.

Type semantics

  • All five functions promote element types via the existing try_supertype machinery so array_union(int_list, float_list) works correctly.
  • Output element nullability is preserved consistently with other list functions.

Implementation note

Set operations are implemented with a hashed probe table over the right-hand list (mirroring the existing list_distinct approach), giving O(N + M) behavior per row instead of O(N·M).

PR: https://github.com/Eventual-Inc/Daft/pull/new/feat/spark-array-functions
Branch: feat/spark-array-functions

Describe alternatives you've considered

  1. Pure Python UDFs — users could compose list_distinct, list_value_counts, is_in, list_filter, etc. to emulate these. Rejected because:

    • It produces verbose, error-prone code (especially for NULL semantics, which differ subtly between Spark and naive set operations).
    • It bypasses native vectorized execution, costing significant performance.
    • It does not solve the SQL-compatibility gap — array_* SQL identifiers still need to be resolved by the planner.
  2. Implement only the SQL-side aliases that map to existing list_* functions — Rejected because Daft does not yet have native equivalents for except / intersect / union / compact / position semantics as Spark defines them (set-deduplicated with NULL drop). Aliasing alone is insufficient.

  3. Wait for Substrait / external library integration to provide these — Rejected because these are simple, foundational list primitives that should live in daft-functions-list next to their existing siblings (list_distinct, list_contains, …) for consistency and discoverability.

The chosen approach (native Rust kernels + Python + SQL) is the same pattern Daft already uses for its existing list functions, so it integrates cleanly with no new abstractions.

Component(s)

No response

Additional Context

No response

Would you like to implement a fix?

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions