Skip to content

fix: rewrite concat(array, ...) to array_concat#21689

Open
hcrosse wants to merge 4 commits intoapache:mainfrom
hcrosse:fix/concat-array-rewrite
Open

fix: rewrite concat(array, ...) to array_concat#21689
hcrosse wants to merge 4 commits intoapache:mainfrom
hcrosse:fix/concat-array-rewrite

Conversation

@hcrosse
Copy link
Copy Markdown
Contributor

@hcrosse hcrosse commented Apr 17, 2026

Which issue does this PR close?

Rationale for this change

concat(array, array, ...) in SQL dispatches to the string concat UDF, which only handles string and binary types. With array arguments the call is coerced to a string form and concatenated textually, so concat([1,2,3], [4,5]) returns [1,2,3][4,5] instead of [1,2,3,4,5]. We want concat to work on arrays rather than rejecting the call as a type error, to match DuckDB's behavior.

Two earlier attempts were rejected. #18137 changed ConcatFunc's signature to accept arrays and broke simplify_expressions. #18105 duplicated array_concat logic inside ConcatFunc.

This PR rewrites concat(array, ...) to array_concat(array, ...) at the analyzer phase. Every logical plan gets the corrected behavior regardless of frontend, the string concat signature stays untouched, and no array logic is duplicated.

What changes are included in this PR?

A new ConcatArrayRewrite FunctionRewrite lives in datafusion-functions-nested. It detects calls to ConcatFunc by identity check via Any::is::<ConcatFunc>, so user-level shadowing of concat such as Spark's variant is unaffected. When all args resolve to List, LargeList, or FixedSizeList it rewrites to array_concat_udf(). Mixed array and non-array returns a plan_err.

The rewrite is wired into SessionStateDefaults::default_function_rewrites() and registered on the analyzer in SessionStateBuilder::with_default_features(), which is the actual default-init path. It's also registered via FunctionRegistry::register_function_rewrite in functions_nested::register_all as a fallback for custom registries.

Known limitation: concat(List, LargeList) hits an existing array_concat coercion bug (#21702). FSL + List works.

Are these changes tested?

New SLT coverage in array/array_concat.slt:

  • 2- and 3-argument array concat, table-column concat, arrays with NULLs, string arrays
  • LargeList, FixedSizeList, and FSL + List mixed inputs
  • NULL::integer[] at either position, all-NULL case
  • Two error cases for mixed array and non-array

explain.slt is updated to reflect the new apply_function_rewrites line that now appears in EXPLAIN VERBOSE output.

Are there any user-facing changes?

concat(array, ...) now returns correct array_concat results instead of the prior wrong output. No public API changes.

@github-actions github-actions Bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 17, 2026
Copy link
Copy Markdown
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

Can we add a brief note to the PR description for why we want concat to work on arrays to begin with (e.g., rather than rejecting it as an error). Is the reason DuckDB compatibility?

It's a bit unfortunate that we can't tighten the signature of concat to reject non-string arguments, but that seems non-trivial.

Seems like concat_ws has similar problems to what this PR is addressing for concat. Not sure offhand the best fix (reject it?), but maybe worth filing a separate issue.

Comment thread datafusion/sqllogictest/test_files/array/array_concat.slt
Comment thread datafusion/functions-nested/src/concat_rewrite.rs Outdated
- use array_concat expr_fn instead of hand-constructing ScalarFunction
- add FSL + List SLT case to cover mixed list-variant coercion
@Jefffrey
Copy link
Copy Markdown
Contributor

I think I had a similar thought in that we should try to solve this in the optimizer/simplify stage, but I believe one of the drivers for this was Spark compatibility. See this comment:

cc @comphead

@comphead
Copy link
Copy Markdown
Contributor

I think I had a similar thought in that we should try to solve this in the optimizer/simplify stage, but I believe one of the drivers for this was Spark compatibility. See this comment:

cc @comphead

Thanks @Jefffrey
My concern on this approach that downstream project used to call builtin functions or spark built in functions without going through full Datafusion pipeline, and suspecting in this case the downstream users can feel benefit from this PR

Comment thread datafusion/functions-nested/src/concat_rewrite.rs
@hcrosse
Copy link
Copy Markdown
Contributor Author

hcrosse commented Apr 21, 2026

My concern on this approach that downstream project used to call builtin functions or spark built in functions without going through full Datafusion pipeline, and suspecting in this case the downstream users can feel benefit from this PR

@comphead I checked Comet's native planner and it looks like it builds PhysicalExprs directly from its own protobuf serde rather than going through SessionState::create_physical_expr, so don't think FunctionRewrites apply there. Comet's Scala-side CometConcat serde also already gates to all-string children and falls back to Spark JVM for any array-typed concat, so concat(array, ...) doesn't cross into native in Comet today. So this PR shouldn't be a regression for Comet as it stands (I think 😅)

If native array concat support gets added later, there are two hooks: on the Comet side, having CometConcat detect array children and emit array_concat in the protobuf and on the DataFusion side, execution-layer dispatch inside ConcatFunc::invoke_with_args, which needs an element-wise list concat kernel arrow-rs doesn't currently expose. apache/arrow-rs#1772 tracks adding that kernel. Open to other approaches if there's one I'm missing - still trying to learn how everything pieces together!

Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me 👍

Comment thread datafusion/functions-nested/src/concat_rewrite.rs Outdated
Comment thread datafusion/functions-nested/src/concat_rewrite.rs
@comphead
Copy link
Copy Markdown
Contributor

If native array concat support gets added later, there are two hooks: on the Comet side, having CometConcat detect array children and emit array_concat in the protobuf and on the DataFusion side, execution-layer dispatch inside ConcatFunc::invoke_with_args, which needs an element-wise list concat kernel arrow-rs doesn't currently expose. apache/arrow-rs#1772 tracks adding that kernel. Open to other approaches if there's one I'm missing - still trying to learn how everything pieces together!

Sorry folks, I still not very confident with this change.

From Comet perspective you totally right, Comet typically has patches to be Spark compliant, some times going ahead and then backport changes to DataFusion, but there are also other Spark based project like LakeSail and other folks that directly replace functions with DF counterparties and my expectation would be them also benefit on this function.

My understanding the implementation should be pretty straightforward, just delegate the call to array_concat if incoming arguments for concat are lists, however there is a fundamental issue with dependencies, namely:

The problem: ConcatFunc (string) and ArrayConcat live in separate crates with a one-way dependency (functions-nested → functions), so adding a reverse dependency would be circular. That means
ConcatFunc::invoke_with_args can't call array_concat_inner directly.

I think we need to investigate how to overcome this issue.

@hcrosse
Copy link
Copy Markdown
Contributor Author

hcrosse commented Apr 22, 2026

Sorry folks, I still not very confident with this change.

...

My understanding the implementation should be pretty straightforward, just delegate the call to array_concat if incoming arguments for concat are lists, however there is a fundamental issue with dependencies, namely:

The problem: ConcatFunc (string) and ArrayConcat live in separate crates with a one-way dependency (functions-nested → functions), so adding a reverse dependency would be circular. That means ConcatFunc::invoke_with_args can't call array_concat_inner directly.

I think we need to investigate how to overcome this issue.

Yeah that's fair. I think that call probably takes more familiarity with the project than I currently have, so it might be best to close this PR for now.

I'll take a look at implementing the list parts of apache/arrow-rs#1772 in the meantime - if I can land that in a reasonable timeframe it may be the cleanest solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unexpected output for concat for arrays

4 participants