fix(topk): avoid overflow panic in interleave emission for Utf8#20494
fix(topk): avoid overflow panic in interleave emission for Utf8#20494aviralgarg05 wants to merge 7 commits intoapache:mainfrom
Utf8#20494Conversation
kosiew
left a comment
There was a problem hiding this comment.
@aviralgarg05
Thanks for working on this
|
FYI, I encountered a similar case in sort merge: #20922, and there is a PR changing panic to error for interleavehttps://github.com/apache/arrow-rs/pull/9549 |
|
run benchmarks |
|
🤖 Benchmark completed (GKE) | trigger Details
Resource Usagetpch — base (merge-base)
tpch — branch
|
|
🤖 Benchmark completed (GKE) | trigger Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
|
|
🤖 Benchmark completed (GKE) | trigger Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
|
alamb
left a comment
There was a problem hiding this comment.
Thank you for this contirbution @aviralgarg05 and @xudong963 -- I am a little worried about the complexity added in this patch for two reasons:
- It doesn't appear to be adequately tested
- It may introduce a performance regression when generating output
Typically when people hit offset errors related to using Utf8 (because of offset overflows) our advice is either:
- Use Utf8View ("String View")
- Use LargeUtf8
I am not sure it is a good tradeoff to slow down / make the Utf8 path more complex
Utf8
Yes, the changes are in the hot path, for all cases, it'll do some checks. I encountered a similar overflow in the SPM path #20922, and introduced a different way to fix it: if we get the panic info, capture it, then progressively retry until no overflow, which only affects the overflow path. |
Which issue does this PR close?
Rationale for this change
While testing, I found that
ORDER BY ... LIMITcould panic in TopK when interleaving very largeUtf8values. The problem was that TopK tried to interleave all selected rows in a single call, which can overflow Arrow’s i32 string offsets in extreme cases. This PR removes that panic path so the operator behaves safely under large-string workloads.What changes are included in this PR?
TopK.batch_sizefield.test_topk_heap_emit_with_state_respects_batch_size.Are these changes tested?
Yes.
datafusion/physical-plan/src/topk/mod.rsfor chunked TopK emission behavior.cargo fmt --all --check.cargo clippy -p datafusion-physical-plan --lib --tests -- -D warnings.cargo test -p datafusion-physical-plan topk::tests:: -- --nocapturethree times after final changes.Are there any user-facing changes?
Yes. For extreme large-string cases in
ORDER BY ... LIMIT, TopK no longer panics with overflow; it now follows normal error/execution handling.