[PySpark] feat: add explode and explode_outer to experimental PySpark API#415
Open
tinovyatkin wants to merge 1 commit intoduckdb:mainfrom
Open
[PySpark] feat: add explode and explode_outer to experimental PySpark API#415tinovyatkin wants to merge 1 commit intoduckdb:mainfrom
tinovyatkin wants to merge 1 commit intoduckdb:mainfrom
Conversation
Implement the `explode` and `explode_outer` collection functions for DuckDB's experimental PySpark-compatible API. These are commonly used PySpark functions for flattening array columns into individual rows. - `explode(col)` maps to DuckDB's `unnest()`, dropping NULL/empty arrays - `explode_outer(col)` preserves NULL/empty array rows by substituting `[NULL]` before unnesting via a CASE expression Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
explode(col)andexplode_outer(col)collection functions in the experimental PySpark-compatible APIexplodemaps to DuckDB'sunnest()function, which natively drops rows with NULL/empty arrays (matching PySpark semantics)explode_outerpreserves rows with NULL/empty arrays by substituting[NULL]via aCaseExpressionbefore unnestingImplementation details
explode(col)is a one-liner wrappingFunctionExpression("unnest", ...), since DuckDB'sunnestalready matches PySpark'sexplodebehavior for arrays (drops NULL/empty).explode_outer(col)builds aCaseExpressionthat replaces NULL or empty arrays with[NULL]before passing tounnest, so those rows appear in the output with a NULL value instead of being dropped.Both functions follow the existing patterns used by other collection functions like
flatten,array_compact, andarray_removeinfunctions.py.Not included (future work)
posexplode/posexplode_outer— these produce multiple output columns (pos + value), which requires multi-column generator support beyond the currentColumnabstractionunnestdoesn't accept MAP types directly; this would requiremap_entries()wrapping and struct field extraction