Skip to content

[PySpark] feat: add explode and explode_outer to experimental PySpark API#415

Open
tinovyatkin wants to merge 1 commit intoduckdb:mainfrom
tinovyatkin:feat/spark-explode
Open

[PySpark] feat: add explode and explode_outer to experimental PySpark API#415
tinovyatkin wants to merge 1 commit intoduckdb:mainfrom
tinovyatkin:feat/spark-explode

Conversation

@tinovyatkin
Copy link
Copy Markdown

@tinovyatkin tinovyatkin commented Apr 3, 2026

Summary

  • Implement explode(col) and explode_outer(col) collection functions in the experimental PySpark-compatible API
  • explode maps to DuckDB's unnest() function, which natively drops rows with NULL/empty arrays (matching PySpark semantics)
  • explode_outer preserves rows with NULL/empty arrays by substituting [NULL] via a CaseExpression before unnesting
  • Added 5 tests covering basic usage, NULL/empty handling, Column object input, and the outer variant

Implementation details

explode(col) is a one-liner wrapping FunctionExpression("unnest", ...), since DuckDB's unnest already matches PySpark's explode behavior for arrays (drops NULL/empty).

explode_outer(col) builds a CaseExpression that replaces NULL or empty arrays with [NULL] before passing to unnest, so those rows appear in the output with a NULL value instead of being dropped.

Both functions follow the existing patterns used by other collection functions like flatten, array_compact, and array_remove in functions.py.

Not included (future work)

  • posexplode / posexplode_outer — these produce multiple output columns (pos + value), which requires multi-column generator support beyond the current Column abstraction
  • Map input support — DuckDB's unnest doesn't accept MAP types directly; this would require map_entries() wrapping and struct field extraction

Implement the `explode` and `explode_outer` collection functions for
DuckDB's experimental PySpark-compatible API. These are commonly used
PySpark functions for flattening array columns into individual rows.

- `explode(col)` maps to DuckDB's `unnest()`, dropping NULL/empty arrays
- `explode_outer(col)` preserves NULL/empty array rows by substituting
  `[NULL]` before unnesting via a CASE expression

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tinovyatkin tinovyatkin changed the title Add explode and explode_outer to experimental PySpark API [PySpark]: feat: add explode and explode_outer to experimental PySpark API Apr 3, 2026
@tinovyatkin tinovyatkin changed the title [PySpark]: feat: add explode and explode_outer to experimental PySpark API [PySpark] feat: add explode and explode_outer to experimental PySpark API Apr 3, 2026
@tinovyatkin tinovyatkin marked this pull request as ready for review April 3, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant