Conversation
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Merging this PR will degrade performance by 25.07%
Performance Changes
Comparing Footnotes
|
Polar Signals Profiling ResultsLatest Run
Previous Runs (1)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.978x ➖ datafusion / vortex-file-compressed (0.978x ➖, 0↑ 0↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.997x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.008x ➖, 0↑ 1↓)
datafusion / parquet (0.949x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.976x ➖, 1↑ 0↓)
duckdb / vortex-compact (1.121x ❌, 0↑ 6↓)
duckdb / parquet (0.929x ➖, 2↑ 0↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.149x ❌, 0↑ 18↓)
datafusion / vortex-compact (1.086x ➖, 0↑ 6↓)
datafusion / parquet (1.083x ➖, 0↑ 9↓)
datafusion / arrow (1.089x ➖, 0↑ 5↓)
duckdb / vortex-file-compressed (1.118x ❌, 0↑ 13↓)
duckdb / vortex-compact (1.115x ❌, 0↑ 13↓)
duckdb / parquet (1.041x ➖, 0↑ 2↓)
duckdb / duckdb (1.098x ➖, 0↑ 10↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.975x ➖, 1↑ 1↓)
datafusion / vortex-compact (0.966x ➖, 1↑ 0↓)
datafusion / parquet (0.977x ➖, 2↑ 2↓)
duckdb / vortex-file-compressed (0.951x ➖, 16↑ 1↓)
duckdb / vortex-compact (0.958x ➖, 5↑ 1↓)
duckdb / parquet (0.971x ➖, 5↑ 1↓)
duckdb / duckdb (0.936x ➖, 18↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.039x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.148x ➖, 0↑ 2↓)
datafusion / parquet (1.237x ➖, 0↑ 3↓)
duckdb / vortex-file-compressed (0.999x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.021x ➖, 0↑ 0↓)
duckdb / parquet (1.052x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 0.911x ➖ unknown / unknown (0.924x ➖, 15↑ 0↓)
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.994x ➖, 0↑ 1↓)
duckdb / vortex-compact (0.976x ➖, 0↑ 0↓)
duckdb / parquet (0.980x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.996x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.977x ➖, 3↑ 0↓)
datafusion / parquet (1.004x ➖, 0↑ 0↓)
datafusion / arrow (1.014x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.989x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.996x ➖, 0↑ 0↓)
duckdb / parquet (0.991x ➖, 0↑ 0↓)
duckdb / duckdb (0.988x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.291x ➖, 0↑ 5↓)
datafusion / vortex-compact (1.002x ➖, 1↑ 2↓)
datafusion / parquet (1.209x ➖, 0↑ 9↓)
duckdb / vortex-file-compressed (1.031x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.027x ➖, 0↑ 0↓)
duckdb / parquet (1.020x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.005x ➖ unknown / unknown (0.996x ➖, 1↑ 3↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.082x ➖, 1↑ 3↓)
datafusion / vortex-compact (1.130x ➖, 0↑ 6↓)
datafusion / parquet (1.099x ➖, 0↑ 5↓)
duckdb / vortex-file-compressed (0.923x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.905x ➖, 0↑ 0↓)
duckdb / parquet (0.908x ➖, 0↑ 0↓)
Full attributed analysis
|
| /// Merged stats options from all eligible schemes at this compression site. | ||
| merged_stats_options: GenerateStatsOptions, | ||
|
|
||
| // TODO(connor): Replace this with an `im::Vector` |
There was a problem hiding this comment.
im::Vector is a copy on write vector in a 3rd party library
| if sequence_encode(data.array_as_primitive(), exec_ctx)?.is_none() { | ||
| return Ok(EstimateVerdict::Skip); | ||
| } | ||
| // TODO(connor): Should we get the actual ratio here? |
There was a problem hiding this comment.
we don't if you do the todo above that avoids creating the array
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.227x ❌, 0↑ 42↓)
datafusion / parquet (1.198x ❌, 0↑ 39↓)
duckdb / vortex-file-compressed (1.157x ❌, 1↑ 37↓)
duckdb / parquet (1.136x ❌, 0↑ 32↓)
duckdb / duckdb (1.131x ❌, 0↑ 28↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (3 files changed, -0.0% overall, 2↑ 1↓)
Totals:
|
Summary
Tracking issue: #7216
Right now, we have a very naive sampling strategy in the compressor. We will run the estimation function in the order of the
Schemedeclaration, regardless of how expensive it is to estimate things.This changes the sampling algorithm to first do all of the cheap estimations, and only when all of them fail do we do the more expensive sampling / callbacks.
Edit: so benchmarks show no real change, but theoretically if someone comes along and has a really slow compress scheme, this should lessen the impact of that.
Testing
Will run benchmarks.