fix: rewrite concat(array, ...) to array_concat#21689
fix: rewrite concat(array, ...) to array_concat#21689hcrosse wants to merge 4 commits intoapache:mainfrom
Conversation
neilconway
left a comment
There was a problem hiding this comment.
Overall looks good!
Can we add a brief note to the PR description for why we want concat to work on arrays to begin with (e.g., rather than rejecting it as an error). Is the reason DuckDB compatibility?
It's a bit unfortunate that we can't tighten the signature of concat to reject non-string arguments, but that seems non-trivial.
Seems like concat_ws has similar problems to what this PR is addressing for concat. Not sure offhand the best fix (reject it?), but maybe worth filing a separate issue.
- use array_concat expr_fn instead of hand-constructing ScalarFunction - add FSL + List SLT case to cover mixed list-variant coercion
|
I think I had a similar thought in that we should try to solve this in the optimizer/simplify stage, but I believe one of the drivers for this was Spark compatibility. See this comment: cc @comphead |
Thanks @Jefffrey |
@comphead I checked Comet's native planner and it looks like it builds If native array concat support gets added later, there are two hooks: on the Comet side, having CometConcat detect array children and emit array_concat in the protobuf and on the DataFusion side, execution-layer dispatch inside |
Sorry folks, I still not very confident with this change. From Comet perspective you totally right, Comet typically has patches to be Spark compliant, some times going ahead and then backport changes to DataFusion, but there are also other Spark based project like LakeSail and other folks that directly replace functions with DF counterparties and my expectation would be them also benefit on this function. My understanding the implementation should be pretty straightforward, just delegate the call to array_concat if incoming arguments for The problem: I think we need to investigate how to overcome this issue. |
Yeah that's fair. I think that call probably takes more familiarity with the project than I currently have, so it might be best to close this PR for now. I'll take a look at implementing the list parts of apache/arrow-rs#1772 in the meantime - if I can land that in a reasonable timeframe it may be the cleanest solution. |
Which issue does this PR close?
concatfor arrays #18020Rationale for this change
concat(array, array, ...)in SQL dispatches to the stringconcatUDF, which only handles string and binary types. With array arguments the call is coerced to a string form and concatenated textually, soconcat([1,2,3], [4,5])returns[1,2,3][4,5]instead of[1,2,3,4,5]. We wantconcatto work on arrays rather than rejecting the call as a type error, to match DuckDB's behavior.Two earlier attempts were rejected. #18137 changed
ConcatFunc's signature to accept arrays and brokesimplify_expressions. #18105 duplicatedarray_concatlogic insideConcatFunc.This PR rewrites
concat(array, ...)toarray_concat(array, ...)at the analyzer phase. Every logical plan gets the corrected behavior regardless of frontend, the stringconcatsignature stays untouched, and no array logic is duplicated.What changes are included in this PR?
A new
ConcatArrayRewriteFunctionRewritelives indatafusion-functions-nested. It detects calls toConcatFuncby identity check viaAny::is::<ConcatFunc>, so user-level shadowing ofconcatsuch as Spark's variant is unaffected. When all args resolve toList,LargeList, orFixedSizeListit rewrites toarray_concat_udf(). Mixed array and non-array returns aplan_err.The rewrite is wired into
SessionStateDefaults::default_function_rewrites()and registered on the analyzer inSessionStateBuilder::with_default_features(), which is the actual default-init path. It's also registered viaFunctionRegistry::register_function_rewriteinfunctions_nested::register_allas a fallback for custom registries.Known limitation:
concat(List, LargeList)hits an existingarray_concatcoercion bug (#21702). FSL + List works.Are these changes tested?
New SLT coverage in
array/array_concat.slt:NULL::integer[]at either position, all-NULL caseexplain.sltis updated to reflect the newapply_function_rewritesline that now appears inEXPLAIN VERBOSEoutput.Are there any user-facing changes?
concat(array, ...)now returns correctarray_concatresults instead of the prior wrong output. No public API changes.