Add configurable UNION DISTINCT to FILTER rewrite optimization#21075
Add configurable UNION DISTINCT to FILTER rewrite optimization#21075xiedeyantu wants to merge 1 commit intoapache:mainfrom
Conversation
fef8ec7 to
80adf8b
Compare
4b113f8 to
3b11b5d
Compare
|
Hi @alamb , here's another PR related to plan optimization. Do we need it? I'd also like to know what aspects of optimization we accept. |
|
Hi @Dandandan , I noticed that you’re very knowledgeable about SQL optimization. It would be great if you could help me review this! |
|
@xiedeyantu -- can you please file a ticket explaining what usecase you are targeting with this optimization? Your explanation in
Mostly focuses on "what" is changed, not the "why" In terms of usecase I think what is important:
Then for this optimization it would be great to have some benchmark numbers showing that the query of interest iindeed gets faster with this optimization compared than without it |
|
|
||
| query TT | ||
| EXPLAIN SELECT id, name FROM t1 WHERE id = 1 UNION SELECT id, name FROM t1 WHERE id = 2 | ||
| ---- |
There was a problem hiding this comment.
@alamb Here is an example: There is a rule for eliminating UNION under specific conditions. It applies when the branches of the UNION come from the same table and only differ in their WHERE conditions. This rule allows us to avoid an extra table scan — we only need to perform a single combined conditional filter.
There was a problem hiding this comment.
I haven't looked at the proposed implementation but the rewrite can surely help in case of repeated costly union branches (especially when coming from possibly complex data sources powered via TableProvider).
It seems also particularly relevant until #8777 gets addressed (broader scope, CTE materialization), as currently there is no other way to mutualize repeated reads AFAIK.
There was a problem hiding this comment.
Thank you @asolimando , for connecting me to such a valuable discussion. I think the idea in #8777 is excellent, but it seems we might need to perform a union operation once more. It looks like a final conclusion hasn't been reached yet? However, this approach can support UNION ALL. It seems the two might complement each other?
|
I tested the execution times, and in reality, the difference between the two was not significant. My understanding is that the multiple branches within a The test SQL script is as follows: |
|
@alamb I have refined the PR description and added a test scenario; could you please take a look and let me know if you find it valuable? |
|
@alamb If you have a moment, could you please take another look? Thank you! |
|
@comphead I'm not sure if you can help me review this PR? |
|
I would agree with @alamb to create initial ticket stating the problem. PR description is nice but it is a solution whereas ticket is a problem statement and some people could also participate in problem discussion |
@comphead Apologies—I may not have fully understood the process earlier; I thought simply describing the issue within the PR itself (including both the problem and the proposed solution) would suffice. I have now created a new issue #21310 for everyone to discuss. Please take a look and let me know if it looks appropriate. |
|
Thanks @xiedeyantu I'll take a look this week, would be super useful for users and also for regression to have internal microbenchmarks, similar to |
comphead
left a comment
There was a problem hiding this comment.
Thanks @xiedeyantu for this PR, I think we need to run current tests with this optimizer rule on. The easiest way to proceed if you enable the rule by default in this PR so we can run CI and also benchmarks to get some data points and then disable the rule
@comphead Thank you for your detailed review. I will immediately address these issues based on your comments and temporarily set the parameter to |
|
@comphead Could you tell me how to run benchmark? I have changed |
|
run benchmark |
|
Hi @comphead, Supported benchmarks:
Usage: Per-side configuration ( env:
SHARED_SETTING: enabled
baseline:
ref: v45.0.0
env:
DATAFUSION_RUNTIME_MEMORY_LIMIT: 1G
changed:
ref: v46.0.0
env:
DATAFUSION_RUNTIME_MEMORY_LIMIT: 2GFile an issue against this benchmark runner |
|
run benchmark tpch tpcds |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing union-filter (056e734) to 9885f4b (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
@alamb Could you please help me rerun the benchmark? |
|
run benchmark tpcds |
|
Hi @xiedeyantu, thanks for the request (#21075 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mkleen, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
|
run benchmark tpcds |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
@haohuaijin Thanks for rerunning the benchmark, I don't know why the benchmark can not be ran. Could you tell me the reason? |
i also do not know the reason... |
Thanks all the same. |
@comphead I've already set the default value of the parameter to true, but I don't know why the benchmark won't run. Can you help me take a look? |
3d746ae to
74692d9
Compare
|
run benchmark tpcds tpch |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing union-filter (74692d9) to 29f1acd (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing union-filter (74692d9) to 29f1acd (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
@comphead It seems some SQL queries were affected. Could you please run the process again to rule out environmental factors? Also, I’d like to confirm: is our goal to prevent all SQL queries from being marked as "slower"? If a few individual queries are flagged as "slower," would we consider merging them? Or, if no SQL is affected, do we still need the "false" parameter for control? |
|
run benchmark tpcds tpch |
|
@xiedeyantu just started the benches bench results are prone to some kind of noise, so usually 1.0x in any direction is considered to be noise. |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing union-filter (74692d9) to 29f1acd (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing union-filter (74692d9) to 29f1acd (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
Benchmark for this request hit the 7200s job deadline before finishing. Benchmarks requested: Kubernetes messageFile an issue against this benchmark runner |
I think neither TPC-DS nor TPC-H may have SQL queries suitable for this specific rule. As long as the rule doesn't slow down the query planning of other SQL in the tests, that should be acceptable. |
|
Agree, it is good there are no performance regressions. Let me go through it one final time. Another thing: let's have the rule disabled by default to avoid optimization false positives, and potential user query degradation. We would need to add description of this change to Generally speaking, going forward we need to document optimization rules and technics in a separate doc, so the users are able to check all optimization rules DataFusion is capable of and also how and when to use/enable/disable rules |
@comphead Thank you for the guidance. Could you let me know a specific location where I can mimic the previous style to add the upgrade documentation you mentioned, and then change the configuration parameters to false. |
Please refer to |
74692d9 to
c046597
Compare
|
The security_audit failed appears unrelated to my PR; the main branch has the same issue. |
5626d11 to
bf17571
Compare
Which issue does this PR close?
Rationale for this change
This PR adds a configurable optimizer rewrite for
UNION DISTINCTqueries. The goal is to allow the optimizer to collapse eligible union branches into a single filtered scan when the branches read from the same source and differ only by filter predicates.This optimization can reduce duplicated work and avoid scanning the same input multiple times. Keeping it behind a configuration flag makes the behavior explicit and safe to enable only when desired.
What changes are included in this PR?
datafusion.optimizer.enable_unions_to_filter, which is disabled by default.UnionsToFilteroptimization rule in the logical optimizer pipeline.datafusion/sqllogictest/test_files/union.sltto cover both the disabled and enabled cases.UNION DISTINCTqueries.Example rewrite
When the rule is enabled, a query such as:
SQL
may be rewritten into an equivalent plan that scans
t1once and applies a combined filter such as:SQL
This keeps the results unchanged while avoiding repeated reads from the same source.
Are these changes tested?
Yes.
The new behavior is covered by sqllogictest cases that validate both plan variants:
UNION DISTINCTexecution path when the option is disabledAre there any user-facing changes?
Yes.
A new configuration option is introduced:
datafusion.optimizer.enable_unions_to_filterWhen enabled, some
UNION DISTINCTqueries may be optimized into a different plan shape. Query results remain the same, but the execution plan may change.