Skip to content

Cherry-pick sort-merge fixes to branch-52#21106

Closed
xudong963 wants to merge 4 commits intoapache:branch-52from
massive-com:cherry-pick-sort-merge-fixes-52
Closed

Cherry-pick sort-merge fixes to branch-52#21106
xudong963 wants to merge 4 commits intoapache:branch-52from
massive-com:cherry-pick-sort-merge-fixes-52

Conversation

@xudong963
Copy link
Copy Markdown
Member

Summary

Test plan

  • cargo build -p datafusion-physical-plan compiles cleanly
  • CI passes

🤖 Generated with Claude Code

zhuqi-lucas and others added 4 commits March 16, 2026 18:09
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

This PR fixes memory reservation starvation in sort-merge when multiple
sort partitions share a GreedyMemoryPool.

When multiple `ExternalSorter` instances run concurrently and share a
single memory pool, the merge phase starves:

1. Each partition pre-reserves sort_spill_reservation_bytes via
merge_reservation
2. When entering the merge phase, new_empty() was used to create a new
reservation starting at 0 bytes, while the pre-reserved bytes sat idle
in ExternalSorter.merge_reservation
3. Those freed bytes were immediately consumed by other partitions
racing for memory
4. The merge could no longer allocate memory from the pool → OOM /
starvation

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

~~I can't find a deterministic way to reproduce the bug, but it occurs
in our production.~~ Add an end-to-end test to verify the fix

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cherry-picked commit from branch-51 used `get_reserved_byte_for_record_batch_size`
(1 param), but branch-52 has `get_reserved_bytes_for_record_batch_size` (2 params).
Update the call site to use the branch-52 function signature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation development-process Related to development process of DataFusion logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate proto Related to proto crate datasource Changes to the datasource crate ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate labels Mar 23, 2026
@xudong963 xudong963 closed this Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate development-process Related to development process of DataFusion documentation Improvements or additions to documentation execution Related to the execution crate ffi Changes to the ffi crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants