Fix elapsed_compute metric for Parquet DataSourceExec#20767
Fix elapsed_compute metric for Parquet DataSourceExec#20767alamb merged 3 commits intoapache:mainfrom
elapsed_compute metric for Parquet DataSourceExec#20767Conversation
alamb
left a comment
There was a problem hiding this comment.
Looks reasonable to me -- thank you
I wonder if there is any way to test this 🤔
|
Looks like CI has some failures |
|
I am somewhat concerned about per-batch timing overhead. We've had to relax them in some places in Comet due to the overhead of these timers, especially since the standard library doesn't give you easy access to coarse timers like |
Adapt to new PushDecoderStreamState architecture: add BaselineMetrics to the stream state and wrap the compute path (projection + metrics copy) with an elapsed_compute timer. Update .slt test to use <slt:ignore> for the now-populated elapsed_compute value. Closes apache#18195
ccafa6b to
a0a80db
Compare
|
rebased on latest main — the stream construction was refactored to use also updated the @mbutrovich — good point on timer overhead. the timer here fires once per decoded batch (same granularity as every other operator using |
|
@mbutrovich can you give this one another look? |
|
run benchmarks |
|
I merged up from main and kicked off some benchmarks |
|
🤖 Benchmark completed (GKE) | trigger Details
Resource Usagetpch — base (merge-base)
tpch — branch
|
|
🤖 Benchmark completed (GKE) | trigger Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
|
|
🤖 Benchmark completed (GKE) | trigger Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
|
|
I don't see any noticable difference in the benchmarks, so let's get this one in. Thank you @ernestprovo23 and @mbutrovich |
Which issue does this PR close?
Closes part of #18195 — specifically the
elapsed_computebaseline metric sub-item for Parquet scans.Rationale
EXPLAIN ANALYZEon Parquet scans reportselapsed_computevalues like14nsfor full table scans, which is misleading. The metric was never being populated because no timer wrapped the per-batch compute work in the Parquet scan path.What changes are included in this PR?
Follows the same pattern established in PR #18901 (CSV fix):
BaselineMetricsinstantiation inParquetOpener::open()using the existingmetricsandpartition_indexfields.map()closure with anelapsed_computetimer that measures projection, schema replacement, and metrics copy workSingle file changed:
datafusion/datasource-parquet/src/opener.rs(+7, -3 lines)Are these changes tested?
datafusion-datasource-parquetpasselapsed_computevalues inEXPLAIN ANALYZEoutput (no longer showing nanosecond-level values for real scans)Are there any user-facing changes?
EXPLAIN ANALYZEoutput for Parquet scans will now show accurateelapsed_computevalues reflecting actual CPU time spent on per-batch processing.