Remove fewer Storage calls in CopyProp and GVN#142531
Remove fewer Storage calls in CopyProp and GVN#142531rust-bors[bot] merged 1 commit intorust-lang:mainfrom
Conversation
|
Some changes occurred to MIR optimizations cc @rust-lang/wg-mir-opt |
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…try>
Remove fewer Storage calls in `copy_prop`
Modify the `copy_prop` MIR optimization pass to remove fewer `Storage{Live,Dead}` calls, allowing for better optimizations by LLVM - see #141649.
### Details
This is my attempt to fix the mentioned issue (this is the first part, I also implemented a similar solution for GVN in [this branch](https://github.com/rust-lang/rust/compare/master...ohadravid:rust:better-storage-calls-gvn-v2?expand=1)).
The idea is to use the `MaybeStorageDead` analysis and remove only the storage calls of `head`s that are maybe-storage-dead when the associated `local` is accessed (or, conversely, keep the storage of `head`s that are for-sure alive in _every_ relevant access).
When combined with the GVN change, the final example in the issue (#141649 (comment)) is optimized as expected by LLVM. I also measured the effect on a few functions in `rav1d` (where I originally saw the issue) and observed reduced stack usage in several of them.
This is my first attempt at working with MIR optimizations, so it's possible this isn't the right approach — but all tests pass, and the resulting diffs appear correct.
r? tmiasko
since he commented on the issue and pointed to these passes.
|
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (ef7d206): comparison URL. Overall result: ❌ regressions - please read the text belowBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary 0.7%, secondary 3.4%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary -0.6%, secondary -0.1%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary 0.0%, secondary 0.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 757.399s -> 756.065s (-0.18%) |
|
@matthiaskrgr - I updated the impl to stop re-checking once a head is found to be maybe-dead, which should be a bit better |
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…try>
Remove fewer Storage calls in `copy_prop`
Modify the `copy_prop` MIR optimization pass to remove fewer `Storage{Live,Dead}` calls, allowing for better optimizations by LLVM - see #141649.
### Details
This is my attempt to fix the mentioned issue (this is the first part, I also implemented a similar solution for GVN in [this branch](https://github.com/rust-lang/rust/compare/master...ohadravid:rust:better-storage-calls-gvn-v2?expand=1)).
The idea is to use the `MaybeStorageDead` analysis and remove only the storage calls of `head`s that are maybe-storage-dead when the associated `local` is accessed (or, conversely, keep the storage of `head`s that are for-sure alive in _every_ relevant access).
When combined with the GVN change, the final example in the issue (#141649 (comment)) is optimized as expected by LLVM. I also measured the effect on a few functions in `rav1d` (where I originally saw the issue) and observed reduced stack usage in several of them.
This is my first attempt at working with MIR optimizations, so it's possible this isn't the right approach — but all tests pass, and the resulting diffs appear correct.
r? tmiasko
since he commented on the issue and pointed to these passes.
|
Should this check happen in |
|
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
I'm not sure how to make this work: using Is there a different way to do this? |
|
Finished benchmarking commit (c0a2949): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary -0.1%, secondary -1.3%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary -1.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary -0.0%, secondary 0.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 756.494s -> 757.685s (0.16%) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
💔 Test for af9ddc6 failed: CI. Failed job:
|
|
Hi @saethlin looks like the |
|
Codegen test annotations support revisions, and the revision name can be used instead of the |
d18b665 to
5632001
Compare
|
@saethlin done, split checks to |
|
@bors try jobs=x86_64-msvc-1 |
This comment has been minimized.
This comment has been minimized.
…try> Remove fewer Storage calls in CopyProp and GVN try-job: x86_64-msvc-1
|
@bors r=tmiasko,cjgillot,saethlin rollup=never |
This comment has been minimized.
This comment has been minimized.
What is this?This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.Comparing 7a38981 (parent) -> 8da2d28 (this PR) Test differencesShow 33 test diffsStage 1
Stage 2
Additionally, 2 doctest diffs were found. These are ignored, as they are noisy. Job group index
Test dashboardRun cargo run --manifest-path src/ci/citool/Cargo.toml -- \
test-dashboard 8da2d28cbd5a4e2b93e028e709afe09541671663 --output-dir test-dashboardAnd then open Job duration changes
How to interpret the job duration changes?Job durations can vary a lot, based on the actual runner instance |
|
Finished benchmarking commit (8da2d28): comparison URL. Overall result: ❌✅ regressions and improvements - please read:Our benchmarks found a performance regression caused by this PR. Next Steps:
@rustbot label: +perf-regression Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary 0.4%, secondary -5.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary 9.5%, secondary 5.2%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary 0.4%, secondary 0.7%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 492.147s -> 493.313s (0.24%) |
|
I think the opt regressions are expected since we do more work, but the debug/check are bad and I think are because I introduced some unintended overhead even in non-opt when the full analysis doesn't run: // in copyprop
let mut storage_to_remove = DenseBitSet::new_empty(body.local_decls.len()); // extra DenseBitSet
..
for (local, &head) in ssa.copy_classes().iter_enumerated() {
storage_to_remove.insert(head); // + filling it
}
Replacer { tcx, copy_classes: ssa.copy_classes(), unified, storage_to_remove }
// vs before:
Replacer { tcx, copy_classes: ssa.copy_classes(), unified }
// in gvn
let storage_to_remove = state.reused_locals.clone(); // extra clone
StorageRemover { tcx, reused_locals: state.reused_locals, storage_to_remove }
// vs before:
StorageRemover { tcx, reused_locals: state.reused_locals }I should be able to fix this by switching to |
|
The regressions are nearly all in opt, and of those that are in debug, the query breakdown doesn't point at CopyProp as the cause. Opt regressions of this magnitude are rather surprising. |
|
In the perf report it looks like we added CGUs. Are StorageLive/Dead included in the cgu size estimate? |
Looks like it? # compiler/rustc_monomorphize/src/partitioning.rs
providers.queries.size_estimate = |tcx, instance| {
match instance.def {
// "Normal" functions size estimate: the number of
// statements, plus one for the terminator.
InstanceKind::Item(..)
| InstanceKind::DropGlue(..)
| InstanceKind::AsyncDropGlueCtorShim(..) => {
let mir = tcx.instance_mir(instance.def);
mir.basic_blocks.iter().map(|bb| bb.statements.len() + 1).sum()
}
// Other compiler-generated shims size estimate: 1
_ => 1,
}
};and Do you think this throws off the CGU calculation and causes the opt regressions? See results in #155491 (comment) |
Some of them, yes. The key is that in an optimized build, some items will get InstantiationMode::LocalCopy which puts a copy of the item into every CGU that references it. So in general, more CGUs means more items that get optimized more times. The minimum amount of instructions executed to compile the program would probably be at 1 CGU, but that would significantly compromise against wall time. Debug builds can suffer a slightly different problem, where CGUs get merged based on the CGU size estimate, so if the item that's dirtied by the patch that the benchmark suite is applied to is to a very tiny module but the CGU merging process adds it to a larger module, the compile time of that incr-patched scenario is driven by what else is in the CGU that the patched code is merged into. I'm starting a Zulip topic on this: #t-compiler/performance > Perf regression from retaining more StorageLive/StorageDead |
View all comments
Modify the CopyProp and GVN MIR optimization passes to remove fewer
Storage{Live,Dead}calls, allowing for better optimizations by LLVM - see #141649.Details
The idea is to use a new
MaybeUninitializedLocalsanalysis and remove only the storage calls of locals that are maybe-uninit when accessed in a new location.