refactor: Simplify NLJ re-scans with `ReplayableStreamSource` by 2010YOUY01 · Pull Request #21742 · apache/datafusion

2010YOUY01 · 2026-04-20T07:10:47Z

Which issue does this PR close?

Closes #.

Rationale for this change

Background

#21448 introduced memory-limited execution for NLJ (thanks to @viirya).

The idea is:

Load the build (left) side of the NLJ until the memory limit is reached.
Probe the right side and complete the join for the current buffered build-side chunk.
Load the next chunk of the build side and repeat the right-side scan until all data is processed.

To support repeated probing of the right side, input batches are spilled to disk during the first pass. In subsequent passes, input is read directly from the spill. This avoids re-evaluating potentially expensive pipelines (e.g., Parquet decoding + filtering), making repeated probes both memory-efficient and fast.

This PR

This PR extracts the spill-backed replayable stream into a separate module to simplify the NLJ implementation.

Although the lines of code increase, the new module provides a clearer interface and I think it's easier to maintain.

Additionally, this utility may be useful elsewhere. I have seen a similar pattern in SedonaDB for memory-limited spatial joins.

What changes are included in this PR?

Introducing ReplayableStreamSource for the above purpose
Refactor NLJ logic with ReplayableStreamSource

Are these changes tested?

UTs, also covered by existing memory-limited NLJ test

Are there any user-facing changes?

No

Co-authored-by: Yongting You <2010youy01@gmail.com>

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>

martin-g

Thanks!

Few nitpicks which you could just ignore!

refactor: Simplify NLJ re-scans with ReplayableStreamSource

65ff524

github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 20, 2026

more comments

a8f702d

2010YOUY01 commented Apr 20, 2026

View reviewed changes

Comment thread datafusion/physical-plan/src/joins/nested_loop_join.rs Outdated

fix typo

f97f5bc

Co-authored-by: Yongting You <2010youy01@gmail.com>

martin-g reviewed Apr 20, 2026

View reviewed changes

2010YOUY01 and others added 4 commits April 21, 2026 16:09

review: refactor to make shared state management more clear

19494c2

fix lint

cf8a5d2

Update datafusion/physical-plan/src/spill/replayable_spill_input.rs

ecac3b8

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>

fix ci

b08d2e2

martin-g approved these changes Apr 21, 2026

View reviewed changes

2010YOUY01 added 3 commits April 21, 2026 17:34

review: cleanup

8403dae

rename struct

d7e81d9

ci fix

3eba174

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Simplify NLJ re-scans with `ReplayableStreamSource`#21742

refactor: Simplify NLJ re-scans with `ReplayableStreamSource`#21742
2010YOUY01 wants to merge 10 commits intoapache:mainfrom
2010YOUY01:nlj-refactor

2010YOUY01 commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martin-g left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

2010YOUY01 commented Apr 20, 2026

Which issue does this PR close?

Rationale for this change

Background

This PR

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martin-g left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants