fix: improve sort pushdown benchmark data and add DESC LIMIT queries by zhuqi-lucas · Pull Request #21711 · apache/datafusion

zhuqi-lucas · 2026-04-18T05:05:20Z

Which issue does this PR close?

Related to #21580

Rationale for this change

The sort pushdown benchmark had two problems:

Broken data generation: The single-file ORDER BY approach caused the parquet writer to merge rows from adjacent chunks at RG boundaries, widening RG ranges to ~6M. The per-file split fix gave each file only 1 RG, so reorder_by_statistics (intra-file optimization) had nothing to reorder.
Missing DESC LIMIT queries: The sort_pushdown benchmark only had ASC queries (sort elimination). No queries tested the reverse scan + TopK path (Inexact sort pushdown), which is where RG reorder, stats init, and cumulative pruning provide 20-58x improvement.

What changes are included in this PR?

1. Fix benchmark data generation

Generate multiple files with multiple scrambled RGs each:

inexact: 3 files x ~20 RGs each
overlap: 5 files x ~12 RGs each

Uses pyarrow to redistribute RGs from a sorted temp file into multiple output files with scrambled RG order. Each RG has a narrow l_orderkey range (~100K) but appears in scrambled order within its file.

2. Add DESC LIMIT queries to sort_pushdown benchmark

New q5-q8 for sort_pushdown (sorted data, WITH ORDER):

Query	Description
q5	`ORDER BY l_orderkey DESC LIMIT 100` (narrow projection)
q6	`ORDER BY l_orderkey DESC LIMIT 1000` (narrow projection)
q7	`SELECT * ORDER BY l_orderkey DESC LIMIT 100` (wide projection)
q8	`SELECT * ORDER BY l_orderkey DESC LIMIT 1000` (wide projection)

These test the Inexact sort pushdown path: reverse scan + TopK + dynamic filter, which benefits from the optimizations in #21580.

Are these changes tested?

Benchmark changes only. Verified locally:

Data generation produces correct multi-file multi-RG output
DESC LIMIT queries return correct results
q5-q8 show 20-58x improvement with feat: statistics-driven TopK optimization for parquet (file reorder + RG reorder + threshold init + cumulative prune) #21580 optimizations

Are there any user-facing changes?

No. Adds pyarrow as a dependency for generating benchmark datasets (pip install pyarrow).

Copilot

Pull request overview

Updates the sort pushdown Inexact overlap benchmark data generator so it actually produces parquet row groups that are both (a) overlapping and (b) out of order, making reorder_by_statistics do meaningful work (matching the streaming / delayed-chunk scenario described in #21580/#21580 follow-ups).

Changes:

Changes overlap dataset generation to order by a deterministic chunk permutation key plus a jittered l_orderkey, producing scrambled + overlapping RGs.
Updates benchmark script messaging/comments to reflect the new generation strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zhuqi-lucas · 2026-04-18T06:34:41Z

Both Copilot comments were about the old ORDER BY + jitter/scramble approach, which has been completely replaced.

The new approach uses pyarrow to redistribute row groups from a sorted temp file into multiple output files with scrambled RG order:

inexact: 3 files x ~20 RGs each (scrambled within each file)
overlap: 5 files x ~12 RGs each (different permutation)

Each RG has a narrow l_orderkey range (~100K) but appears in scrambled order within its file. This properly tests:

Row-group-level reorder (reorder_by_statistics within each file)
TopK threshold initialization from RG statistics
File-level ordering effects

The single-file ORDER BY approach was fundamentally flawed in two ways:

Parquet writer merges rows from adjacent chunks at RG boundaries, widening ranges to ~6M
Splitting into one-RG-per-file meant reorder_by_statistics (an intra-file optimization) had nothing to reorder

Both the inexact and overlap benchmark data generation had problems: 1. The original single-file ORDER BY approach caused the parquet writer to merge rows from adjacent chunks at RG boundaries, widening RG ranges to ~6M and making reorder_by_statistics a no-op. 2. The per-file split fix (one RG per file) meant reorder_by_statistics had nothing to reorder within each file, since each had only 1 RG. RG reorder is an intra-file optimization. Fix by generating multiple files where each file has MULTIPLE row groups with scrambled order: - inexact: 3 files x ~20 RGs each (scrambled within each file) - overlap: 5 files x ~12 RGs each (different permutation) Each RG has a narrow l_orderkey range (~100K) but appears in scrambled order within its file. This properly tests: - Row-group-level reorder (reorder_by_statistics within each file) - TopK threshold initialization from RG statistics - File-level ordering effects Uses pyarrow to read RGs from a sorted temp file and redistribute them into multiple output files with scrambled RG order.

Add q5-q8: ORDER BY l_orderkey DESC LIMIT queries on sorted data. These test the reverse scan + TopK optimization path (Inexact sort pushdown) which benefits from RG reorder, stats init, and cumulative pruning. q5: DESC LIMIT 100 (narrow projection) q6: DESC LIMIT 1000 (narrow projection) q7: DESC LIMIT 100 (wide projection) q8: DESC LIMIT 1000 (wide projection)

…s PR - benchmarks/bench.sh: data generation fix belongs in apache#21711 - listing_table_partitions.slt: unrelated change from another branch Fuzz test tiebreaker retained (needed for RG reorder on all TopK).

alamb

I ran this like

./benchmarks/bench.sh data sort_pushdown_inexact

And then verified the parquet layout like this:

> select filename, row_group_id, row_group_num_rows from parquet_metadata('/Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet') group by 1,2,3 order by 1,2,3;
+--------------------------------------------------------------------------------------------------------+--------------+--------------------+
| filename                                                                                               | row_group_id | row_group_num_rows |
+--------------------------------------------------------------------------------------------------------+--------------+--------------------+
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 0            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 1            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 2            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 3            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 4            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 5            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 6            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 7            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 8            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 9            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 10           | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 11           | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 12           | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 13           | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 14           | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 15           | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 16           | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 17           | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 18           | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact/lineitem/part_001.parquet | 19           | 100000             |
+--------------------------------------------------------------------------------------------------------+--------------+--------------------+
20 row(s) fetched.
Elapsed 0.004 seconds.

> select filename, row_group_id, row_group_num_rows from parquet_metadata('/Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet') group by 1,2,3 order by 1,2,3;
+----------------------------------------------------------------------------------------------------------------+--------------+--------------------+
| filename                                                                                                       | row_group_id | row_group_num_rows |
+----------------------------------------------------------------------------------------------------------------+--------------+--------------------+
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 0            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 1            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 2            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 3            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 4            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 5            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 6            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 7            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 8            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 9            | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 10           | 100000             |
| /Users/andrewlamb/Software/datafusion2/benchmarks/data/sort_pushdown_inexact_overlap/lineitem/part_004.parquet | 11           | 100000             |
+----------------------------------------------------------------------------------------------------------------+--------------+--------------------+
12 row(s) fetched.
Elapsed 0.007 seconds.

As an aside, I wonder if it would be easier to understand these benchmarks of we could consolidate it somehow -- we have a bunch of cases that may be somewhat confusing:

# Sort Pushdown Benchmarks
sort_pushdown:          Sort pushdown baseline (no WITH ORDER) on TPC-H data (SF=1)
sort_pushdown_sorted:   Sort pushdown with WITH ORDER — tests sort elimination on non-overlapping files
sort_pushdown_inexact:  Sort pushdown Inexact path (--sorted DESC) — multi-file with scrambled RGs, tests reverse scan + RG reorder
sort_pushdown_inexact_unsorted: Sort pushdown Inexact path (no WITH ORDER) — same data, tests Unsupported path + RG reorder
sort_pushdown_inexact_overlap: Sort pushdown Inexact path — multi-file scrambled RGs (streaming data scenario)

zhuqi-lucas · 2026-04-22T04:02:01Z

Thanks @alamb for review:

I am addressing review comments

I will update the benchmarks README to document the new q5-q8 DESC LIMIT queries.

I tried using pure datafusion-cli initially, but DataFusion's COPY writes rows sequentially. When two adjacent chunks have different l_orderkey ranges, the RG boundary merges rows from both, widening the min/max range to ~6M instead of ~100K. This defeats reorder_by_statistics. pyarrow's ParquetWriter.write_table() per-RG is the only way to get narrow-range RGs in scrambled order. Happy to add a small Rust helper instead if the python dependency is a concern.

Will add a check before the python block:

if ! python3 -c "import pyarrow" 2>/dev/null; then
    echo "Error: pyarrow is required. Install with: pip install pyarrow"
    exit 1
fi

Both were about the old ORDER BY + jitter approach which has been completely replaced with the pyarrow split approach.

- Add pyarrow dependency check with helpful error message before python3 blocks in data_sort_pushdown_inexact() - Add Sort Pushdown section to benchmarks/README.md documenting all variants, queries (q1-q8), data generation requirements, and run commands

zhuqi-lucas · 2026-04-22T04:50:33Z

Merged to main, thanks @alamb for review!

zhuqi-lucas requested review from adriangb and Copilot April 18, 2026 05:05

Copilot started reviewing on behalf of zhuqi-lucas April 18, 2026 05:05 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Comment thread benchmarks/bench.sh Outdated

Comment thread benchmarks/bench.sh Outdated

zhuqi-lucas mentioned this pull request Apr 18, 2026

feat: statistics-driven TopK optimization for parquet (file reorder + RG reorder + threshold init + cumulative prune) #21580

Open

zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch from ec93c3b to 8fb395a Compare April 18, 2026 06:33

zhuqi-lucas changed the title ~~fix: generate scrambled+overlapping RGs for overlap benchmark~~ fix: generate scrambled benchmark data with correct per-file RG ranges Apr 18, 2026

zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch 2 times, most recently from 6aa52df to 91cdd1b Compare April 19, 2026 14:05

zhuqi-lucas changed the title ~~fix: generate scrambled benchmark data with correct per-file RG ranges~~ fix: generate multi-file benchmark data with scrambled RG order Apr 19, 2026

zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch from 91cdd1b to 122de73 Compare April 19, 2026 14:55

zhuqi-lucas added 2 commits April 21, 2026 16:11

zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch from 122de73 to a2dd162 Compare April 21, 2026 08:12

zhuqi-lucas changed the title ~~fix: generate multi-file benchmark data with scrambled RG order~~ fix: improve sort pushdown benchmark data and add DESC LIMIT queries Apr 21, 2026

zhuqi-lucas requested review from Dandandan and alamb April 21, 2026 08:14

alamb reviewed Apr 21, 2026

View reviewed changes

Comment thread benchmarks/queries/sort_pushdown/q5.sql

Comment thread benchmarks/bench.sh

alamb approved these changes Apr 21, 2026

View reviewed changes

zhuqi-lucas added this pull request to the merge queue Apr 22, 2026

Merged via the queue into apache:main with commit 64619a6 Apr 22, 2026
31 checks passed

zhuqi-lucas deleted the fix/overlap-benchmark-data branch April 22, 2026 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve sort pushdown benchmark data and add DESC LIMIT queries#21711

fix: improve sort pushdown benchmark data and add DESC LIMIT queries#21711
zhuqi-lucas merged 3 commits intoapache:mainfrom
zhuqi-lucas:fix/overlap-benchmark-data

zhuqi-lucas commented Apr 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 18, 2026 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 22, 2026 •

edited

Loading

Uh oh!

zhuqi-lucas commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhuqi-lucas commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

1. Fix benchmark data generation

2. Add DESC LIMIT queries to sort_pushdown benchmark

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhuqi-lucas commented Apr 18, 2026 •

edited

Loading

zhuqi-lucas commented Apr 18, 2026 •

edited

Loading

zhuqi-lucas commented Apr 22, 2026 •

edited

Loading