Skip to content

refactor: Simplify NLJ re-scans with ReplayableStreamSource#21742

Open
2010YOUY01 wants to merge 10 commits intoapache:mainfrom
2010YOUY01:nlj-refactor
Open

refactor: Simplify NLJ re-scans with ReplayableStreamSource#21742
2010YOUY01 wants to merge 10 commits intoapache:mainfrom
2010YOUY01:nlj-refactor

Conversation

@2010YOUY01
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

Background

#21448 introduced memory-limited execution for NLJ (thanks to @viirya).

The idea is:

  1. Load the build (left) side of the NLJ until the memory limit is reached.
  2. Probe the right side and complete the join for the current buffered build-side chunk.
  3. Load the next chunk of the build side and repeat the right-side scan until all data is processed.

To support repeated probing of the right side, input batches are spilled to disk during the first pass. In subsequent passes, input is read directly from the spill. This avoids re-evaluating potentially expensive pipelines (e.g., Parquet decoding + filtering), making repeated probes both memory-efficient and fast.

This PR

This PR extracts the spill-backed replayable stream into a separate module to simplify the NLJ implementation.

Although the lines of code increase, the new module provides a clearer interface and I think it's easier to maintain.

Additionally, this utility may be useful elsewhere. I have seen a similar pattern in SedonaDB for memory-limited spatial joins.

What changes are included in this PR?

  • Introducing ReplayableStreamSource for the above purpose
  • Refactor NLJ logic with ReplayableStreamSource

Are these changes tested?

UTs, also covered by existing memory-limited NLJ test

Are there any user-facing changes?

No

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 20, 2026
Comment thread datafusion/physical-plan/src/joins/nested_loop_join.rs Outdated
Co-authored-by: Yongting You <2010youy01@gmail.com>
Comment thread datafusion/physical-plan/src/spill/replayable_spill_input.rs Outdated
Comment thread datafusion/physical-plan/src/joins/nested_loop_join.rs
Comment thread datafusion/physical-plan/src/spill/replayable_spill_input.rs
Comment thread datafusion/physical-plan/src/spill/replayable_spill_input.rs Outdated
Comment thread datafusion/physical-plan/src/spill/replayable_spill_input.rs Outdated
Copy link
Copy Markdown
Member

@martin-g martin-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Few nitpicks which you could just ignore!

Comment thread datafusion/physical-plan/src/spill/replayable_spill_input.rs Outdated
Comment thread datafusion/physical-plan/src/spill/replayable_spill_input.rs Outdated
Comment thread datafusion/physical-plan/src/spill/replayable_spill_input.rs Outdated
Comment thread datafusion/physical-plan/src/spill/replayable_spill_input.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants