Rewrite FileStream in terms of Morsel API by alamb · Pull Request #21342 · apache/datafusion

alamb · 2026-04-03T13:32:24Z

Stacked on

Which issue does this PR close?

part of Introduce morsel-driven Parquet scan #20529
Broken out of Sketch out a Morselize API #20820

Rationale for this change

The Morsel API allows for finer grain parallelism (and IO). It is important to have the FileStream work in terms of the Morsel API to allow future features (like workstealing, etc)

What changes are included in this PR?

I apologize for the large diff; Note about 1/2 of this PR is tests and a test framework to test the calling sequence of FileStream.

Rewrite FileStream in terms of the MorselAPI
Add snapshot driven test to document the I/O and CPU patterns in FileStream
Add snapshot based tests that show the ordering of files

Are these changes tested?

Yes by existing functional and benchmark tests, as well as new functional snapshot based tests

Are there any user-facing changes?

No (not yet)

alamb · 2026-04-03T16:15:13Z

+/// This groups together ready planners, ready morsels, the active reader,
+/// pending planner I/O, the remaining files and limit, and the metrics
+/// associated with processing that work.
+pub(super) struct ScanState {


This is the new inner state machine for FileStream

I think some more diagrams in the docstring of the struct and/or fields could help. I'm trying to wrap my head around how the IO queue and such work.

I have added a diagram - let me know if that helps or if there is something else I can do

alamb · 2026-04-03T16:15:37Z

+use std::sync::Arc;
+use std::sync::mpsc::{self, Receiver, TryRecvError};
+
+/// Adapt a legacy [`FileOpener`] to the morsel API.


This is an adapter so that existing FileOpeners continue to have the same behavior

alamb · 2026-04-03T16:15:57Z

@@ -0,0 +1,556 @@
+// Licensed to the Apache Software Foundation (ASF) under one


This is testing infrastructure to write the snapshot tests

Basically it makes a mock morselizer that records its steps so that the control flow of FileStream can be tested / verified

alamb · 2026-04-03T16:16:52Z

-                                    return Poll::Ready(Some(Err(err)));
-                                }
-                            }
+                FileStreamState::Scan { scan_state: queue } => {


moved the inner state machine into a separate module/struct to try and keep indenting under control and encapsualte the complexity somewhat

alamb · 2026-04-03T16:18:30Z

        assert!(err.contains("FileStreamBuilder invalid partition index: 1"));
    }
+
+    /// Verifies the simplest morsel-driven flow: one planner produces one


Here are tests showing the sequence of calls to the various morsel APIs. I intend to use this framework to show how work can migrate from one stream to the other

alamb · 2026-04-03T17:36:36Z

 all-features = true

 [features]
+backtrace = ["datafusion-common/backtrace"]


I added this while debugging why the tests failed on CI and not locally (it was when this feature flag was on the Error messages got mangled).

I added a crate level feature to enable the feature in datafusion-common so I could reproduce the error locally

adriangb

Ran out of time for the last couple of files. A lot of the comments are just tracking my thought process, I plan to go over them again to clarify my own understanding but maybe they're helpful as input on how the code reads top to bottom for a first time reader.

adriangb · 2026-04-03T16:26:53Z

+    /// Creates a `dyn Morselizer` based on given parameters.
+    ///
+    /// The default implementation preserves existing behavior by adapting the
+    /// legacy [`FileOpener`] API into a [`Morselizer`].
+    ///
+    /// It is preferred to implement the [`Morselizer`] API directly by
+    /// implementing this method.
+    fn create_morselizer(
+        &self,
+        object_store: Arc<dyn ObjectStore>,
+        base_config: &FileScanConfig,
+        partition: usize,
+    ) -> Result<Box<dyn Morselizer>> {
+        let opener = self.create_file_opener(object_store, base_config, partition)?;
+        Ok(Box::new(FileOpenerMorselizer::new(opener)))
+    }


adriangb · 2026-04-04T13:53:09Z

    /// Configure the [`FileOpener`] used to open files.
+    ///
+    /// This will overwrite any setting from [`Self::with_morselizer`]
    pub fn with_file_opener(mut self, file_opener: Arc<dyn FileOpener>) -> Self {


While I think it could make sense to keep FileOpener as a public API for building data sources (if we consider it simpler, for folks who don't care about perf), this method in particular seems like a mostly internal method (even if it is pub) on we might as well deprecate / remove.

This method is the way we could keep using FileOpener (as it is simpler)

I am not sure how we could still allow using FileOpener but not keep this method

adriangb · 2026-04-04T14:37:24Z

+    /// The active reader, if any.
+    reader: Option<BoxStream<'static, Result<RecordBatch>>>,


Is there one ScanState across all partitions or one per partition? I'm guessing the latter: file_iter: VecDeque<PartitionedFile> is the files for this partition, we pump all of the files into one output stream of RecordBatch (reader). But we can have multiple planners / morsels ready and merge those all into a single stream of RecordBatch on the way out.

One per partition

we pump all of the files into one output stream of RecordBatch (reader). But we can have multiple planners / morsels ready and merge those all into a single stream of RecordBatch on the way out.

My initial proposal (following @Dandandan 's original design" is that when possible the files are put into a shared queue so that when a FileStream is ready it gets the next file

I think once we get that structure in place, we can contemplate more sophisticated designs (like one filestream preparing a parquet file, and then divying up the record batches between other cores)

Is there one ScanState across all partitions or one per partition? I'm guessing the latter: file_iter: VecDeque is the files for this partition

yes, it is one ScanState per partition

we pump all of the files into one output stream of RecordBatch (reader). But we can have multiple planners / morsels ready and merge those all into a single stream of RecordBatch on the way out.

Yes this is right

My initial proposal (following @Dandandan 's original design" is that when possible the files are put into a shared queue so that when a FileStream is ready it gets the next file

yes, it is one ScanState per partition

I'm a bit confused then: if there is one ScanState per partition then there is one VecDeque<PartitionedFile>, which means it's not shared between partitions. But that would contradict
"files are put into a shared queue so that when a FileStream is ready it gets the next file"?

You can see how cross stream sharing works in the next stacked PR:

Dynamic work scheduling in FileStream #21351

the ScanState is not shared across partitions, but it has a new work_queue that is (potentially) shared. The relevant change is to replace the file_iter with this work_source thing and then handle setting up the work_source in the DataSource exec

pub(super) struct ScanState { /// Files that still need to be planned. file_iter: VecDeque<PartitionedFile>, ...

Wth

pub(super) struct ScanState { /// Unopened files that still need to be planned for this stream. work_source: WorkSource, ...

alamb · 2026-04-06T11:47:57Z

Thanks @adriangb -- I think I am now convinced this will make things faster (see #21351)

Once I finalze that I will then go back and ping / respond to each of these PRs in turn

alamb · 2026-04-06T12:56:54Z

Ok the first PR in the chain is ready for review:

Reduce SortExec memory usage by void constructing single huge batch #2132

(that is basically 50% of this PR)

alamb

Thanks for the comments @adriangb

Once I get this PR merged

#21327

I will come back and update this PR / address your comments

alamb · 2026-04-06T12:58:17Z

    /// Configure the [`FileOpener`] used to open files.
+    ///
+    /// This will overwrite any setting from [`Self::with_morselizer`]
    pub fn with_file_opener(mut self, file_opener: Arc<dyn FileOpener>) -> Self {


This method is the way we could keep using FileOpener (as it is simpler)

I am not sure how we could still allow using FileOpener but not keep this method

alamb · 2026-04-06T13:05:16Z

+    /// The active reader, if any.
+    reader: Option<BoxStream<'static, Result<RecordBatch>>>,


One per partition

we pump all of the files into one output stream of RecordBatch (reader). But we can have multiple planners / morsels ready and merge those all into a single stream of RecordBatch on the way out.

My initial proposal (following @Dandandan 's original design" is that when possible the files are put into a shared queue so that when a FileStream is ready it gets the next file

I think once we get that structure in place, we can contemplate more sophisticated designs (like one filestream preparing a parquet file, and then divying up the record batches between other cores)

alamb

Thank you @adriangb -- I am working on your comments

alamb · 2026-04-13T12:42:15Z

+    /// The active reader, if any.
+    reader: Option<BoxStream<'static, Result<RecordBatch>>>,


You can see how cross stream sharing works in the next stacked PR:

Dynamic work scheduling in FileStream #21351

the ScanState is not shared across partitions, but it has a new work_queue that is (potentially) shared. The relevant change is to replace the file_iter with this work_source thing and then handle setting up the work_source in the DataSource exec

pub(super) struct ScanState { /// Files that still need to be planned. file_iter: VecDeque<PartitionedFile>, ...

Wth

pub(super) struct ScanState { /// Unopened files that still need to be planned for this stream. work_source: WorkSource, ...

alamb · 2026-04-13T12:43:15Z

+                // Morsels should ideally only expose ready-to-decode streams,
+                // but tolerate pending readers here.


That is a good question... I think we would have to change the inner API to use something other than Stream (perhaps just an iterator). I'll see what I can come up with

alamb · 2026-04-13T12:44:40Z

+                    self.ready_morsels.extend(plan.take_morsels());
+                    self.ready_planners.extend(plan.take_ready_planners());
+                    if let Some(pending_planner) = plan.take_pending_planner() {
+                        self.pending_planner = Some(pending_planner);


THis is a good call -- I think a queue of pending planners is best. Will do

alamb · 2026-04-13T15:10:58Z

I have also updated the testing mocks to more closely follow the morsel API (so that I can test the suggestions from @adriangb )

I ran out of time now, but hopefully soon I'll move on to trying to explore:

Is this another case where we could better encode the behavior constraints into the type system?

adriangb · 2026-04-13T15:18:06Z

Let me know when you want me to do the next (which feels like it might be the final) round of review.

alamb · 2026-04-13T15:35:59Z

Let me know when you want me to do the next (which feels like it might be the final) round of review.

Thanks -- I haven't explored this more yet:

Is this another case where we could better encode the behavior constraints into the type system?

But I can also do it as part of a follow on PR (I think it would need to change the Morsel API)

I don't think it really changes this PR per se

adriangb

Makes sense to me. We can iterate on the APIs as a followup but I think we should keep that in mind, it feels like there's some improvements we can make.

alamb · 2026-04-13T16:18:53Z

Makes sense to me. We can iterate on the APIs as a followup but I think we should keep that in mind, it feels like there's some improvements we can make.

Yes it is an excellent point actually -- and one I think we can resolve

alamb · 2026-04-13T16:19:12Z

FYI @Dandandan @zhuliquan and @xudong963 in case you would like to review

alamb · 2026-04-14T11:22:36Z

Makes sense to me. We can iterate on the APIs as a followup but I think we should keep that in mind, it feels like there's some improvements we can make.

Yes it is an excellent point actually -- and one I think we can resolve

Update is I filed a ticket to explain what I found here

Support Morsel output for Parquet known to be non blocking #21598

alamb · 2026-04-14T11:23:03Z

I'll merge this one in and get the final one ready for review

alamb · 2026-04-14T11:23:11Z

Thank you for helping this along @adriangb

Dandandan · 2026-04-14T15:15:07Z

Woah merged!!! 🥳 🥳 🥳

alamb · 2026-04-16T18:27:16Z

Well, this is just one of the refactors -- the real change in behavior (benefit) comes in

Dynamic work scheduling in FileStream #21351

@Dandandan

## Which issue does this PR close? - Closes #20529 - Closes #20820 ## Rationale for this change This PR finally enables dynamic work scheduling in the FileStream (so that if a task is done it can look at any remaining work) This improves performance on queries that scan multiple files and the work is not balanced evenly across partitions in the plan (e.g. we have dynamic filtering that reduces work significantly) It is the last of a sequence of several PRs: - #21342 - #21327 - #21340 ## What changes are included in this PR? 1. Add shared state across sibling FileStream's and the wiring to connect them 2. Sibling streams put their file work into a shared queue when it can be reordered 3. Add a bunch of tests sjpw Note there are a bunch of other things that are NOT included in this PR, including 1. Trying to limit concurrent IO (this PR has the same properties as main -- up to one outstanding IO per partition) 2. Trying to issue multiple IOs by the same partition (aka to interleave IO and CPU work more) 4. Splitting files into smaller units (e.g. across row groups) As @Dandandan proposes below, I expect we can work on those changes as follow on PRs. ## Are these changes tested? Yes by existing functional and benchmark tests, as well as new functional tests ## Are there any user-facing changes? Yes, faster performance (see benchmarks): #21351 (comment) --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>

…er` (apache#21327) ~(Draft until I am sure I can use this API to make FileStream behave better)~ ## Which issue does this PR close? - part of apache#20529 - Needed for apache#21351 - Broken out of apache#20820 - Closes apache#21427 ## Rationale for this change I can get 10% faster on many ClickBench queries by reordeirng files at runtime. You can see it all working together here: apache#21351 To do do, I need to rework the FileStream so that it can reorder operations at runtime. Eventually that will include both CPU and IO. This PR is a step in the direction by introducing the main Morsel API and implementing it for Parquet. The next PR (apache#21342) rewrites FileStream in terms of the Morsel API ## What changes are included in this PR? 1. Add proposed `Morsel` API 2. Rewrite Parquet opener in terms of that API 3. Add an adapter layer (back to FileOpener, so I don't have to rewrite FileStream in the same PR) My next PR will rewrite the FileStream to use the Morsel API ## Are these changes tested? Yes by existing CI. I will work on adding additional tests for just Parquet opener in a follow on PR ## Are there any user-facing changes? No

Stacked on - apache#21327 - apache#21340 ## Which issue does this PR close? - part of apache#20529 - Broken out of apache#20820 ## Rationale for this change The Morsel API allows for finer grain parallelism (and IO). It is important to have the FileStream work in terms of the Morsel API to allow future features (like workstealing, etc) ## What changes are included in this PR? I apologize for the large diff; Note about 1/2 of this PR is tests and a test framework to test the calling sequence of FileStream. 1. Rewrite FileStream in terms of the MorselAPI 2. Add snapshot driven test to document the I/O and CPU patterns in FileStream 3. Add snapshot based tests that show the ordering of files ## Are these changes tested? Yes by existing functional and benchmark tests, as well as new functional snapshot based tests ## Are there any user-facing changes? No (not yet) --------- Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>

@Dandandan

## Which issue does this PR close? - Closes apache#20529 - Closes apache#20820 ## Rationale for this change This PR finally enables dynamic work scheduling in the FileStream (so that if a task is done it can look at any remaining work) This improves performance on queries that scan multiple files and the work is not balanced evenly across partitions in the plan (e.g. we have dynamic filtering that reduces work significantly) It is the last of a sequence of several PRs: - apache#21342 - apache#21327 - apache#21340 ## What changes are included in this PR? 1. Add shared state across sibling FileStream's and the wiring to connect them 2. Sibling streams put their file work into a shared queue when it can be reordered 3. Add a bunch of tests sjpw Note there are a bunch of other things that are NOT included in this PR, including 1. Trying to limit concurrent IO (this PR has the same properties as main -- up to one outstanding IO per partition) 2. Trying to issue multiple IOs by the same partition (aka to interleave IO and CPU work more) 4. Splitting files into smaller units (e.g. across row groups) As @Dandandan proposes below, I expect we can work on those changes as follow on PRs. ## Are these changes tested? Yes by existing functional and benchmark tests, as well as new functional tests ## Are there any user-facing changes? Yes, faster performance (see benchmarks): apache#21351 (comment) --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>

github-actions Bot added the datasource Changes to the datasource crate label Apr 3, 2026

alamb force-pushed the alamb/file_stream_split branch 3 times, most recently from 816d243 to 3346af7 Compare April 3, 2026 16:14

alamb commented Apr 3, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

alamb mentioned this pull request Apr 3, 2026

Dynamic work scheduling in FileStream #21351

Merged

alamb force-pushed the alamb/file_stream_split branch from b5c452a to d5a1f74 Compare April 3, 2026 17:34

This comment has been minimized.

Sign in to view

alamb commented Apr 3, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

alamb force-pushed the alamb/file_stream_split branch from d5a1f74 to b2c9bd6 Compare April 3, 2026 17:38

This comment has been minimized.

Sign in to view

adriangb reviewed Apr 4, 2026

View reviewed changes

alamb mentioned this pull request Apr 6, 2026

Introduce Morselizer API, rewrite ParquetOpener to ParquetMorselizer #21327

Merged

This was referenced Apr 6, 2026

Sketch out a Morselize API #20820

Closed

Introduce morsel-driven Parquet scan #20529

Closed

alamb commented Apr 6, 2026

View reviewed changes

Merge remote-tracking branch 'apache/main' into alamb/file_stream_split

40b027c

alamb commented Apr 13, 2026

View reviewed changes

alamb added 7 commits April 13, 2026 08:49

fix compilation

671938b

Make it clearer that there is only ever a single outstanding IO

42d8750

Add test for pending IO

8b591ac

Update mocks to better match API

e7913da

more updates

7c0ca4f

Simplify testing

7d33eb8

Merge remote-tracking branch 'apache/main' into alamb/file_stream_split

ed9873c

new test

04dbbbf

adriangb approved these changes Apr 13, 2026

View reviewed changes

alamb mentioned this pull request Apr 13, 2026

Support Morsel output for Parquet known to be non blocking #21598

Open

alamb added this pull request to the merge queue Apr 14, 2026

Merged via the queue into apache:main with commit 776b723 Apr 14, 2026
34 checks passed

alamb deleted the alamb/file_stream_split branch April 14, 2026 11:25

		@@ -0,0 +1,556 @@
		// Licensed to the Apache Software Foundation (ASF) under one

		/// The active reader, if any.
		reader: Option<BoxStream<'static, Result<RecordBatch>>>,

		// Morsels should ideally only expose ready-to-decode streams,
		// but tolerate pending readers here.

Conversation

alamb commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Apr 6, 2026

Uh oh!

alamb commented Apr 6, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Apr 3, 2026 •

edited

Loading

alamb commented Apr 14, 2026 •

edited

Loading