Hoist loop-variant affine index ops out of scf.for during codegen#1283
Open
willghatch wants to merge 6 commits intomainfrom
Open
Hoist loop-variant affine index ops out of scf.for during codegen#1283willghatch wants to merge 6 commits intomainfrom
willghatch wants to merge 6 commits intomainfrom
Conversation
When dimensions are dynamic, the linearized index expressions passed to affine.apply contain the loop induction variable, preventing LICM from moving them out of the loop body. Split each such expression into `base + iv * stride`, generate the base before the loop with gen_sympy_index_hoisted, and emit the IV-dependent part as plain arith ops that downstream LICM can reason about independently. The key test is `lit_tests/kernel/wave/linearize_loop_affine_maps.py` which shows that no `affine.apply` ops occur inside a loop body. Key changes: - Add gen_sympy_index_hoisted (emitter.py): decomposes a sympy expression into a loop-invariant base (hoisted) and an IV-scaled offset (emitted in-place), with reconstruction guards that fall back to the original gen_sympy_index when the split is not provably safe. - Refactor read_write.py: extract _get_enclosing_scf_for (walks parent regions), _hoist_before_loop, _iv_context, _gen_linear_index_offset, _hoist_dst_indices, and _linearize_memref_hoisted to replace ad-hoc hoisting scattered across _handle_read_linear_index and handle_gather_to_lds. - Rework _build_mask to lower each bound condition individually with hoisting support, replacing the previous functools.reduce over a single sympy And expression. - Add linearize_loop_affine_maps.py lit test verifying no affine.apply ops remain inside the pipelined scf.for body for two wave shapes. - Update CHECK patterns in 5 existing lit tests to reflect the new code shape (fewer affine_map parameters, hoisted base ops before the loop, arith.muli/addi inside the loop). Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
Add iv_offset caching to deduplicate iv*stride products across reads that share the same stride coefficient within a loop body. Add gen_sympy_index_no_affine for computing expressions in-place using arith ops (no affine.apply) inside loop bodies. Use no-affine mode for the guard condition in _compute_branchless_valid_bytes. Also add use_affine parameter to gen_sympy_index for local override of the global affine expression mode. Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
ftynse
reviewed
Apr 9, 2026
Contributor
ftynse
left a comment
There was a problem hiding this comment.
Arguably, this should just be an "advanced LICM" MLIR pass rather than extra ad hoc logic in whatever emits MLIR. It is also unclear to me what is the value of having affine.apply to start with, expression simplification should have happened in sympy sufficiently.
Contributor
Author
|
This approach was mostly taken as an expedient quick path to getting the mxfp4 assembly output that we wanted, working around limitations of our LICM pass. I agree that it is not the optimal way to do this. |
_verify_stride_on_original was rejecting valid stride decompositions for B-data and B-scale reads because it probed with arbitrary values that violated divisibility assumptions (e.g. K not a multiple of 256). This made floor/Mod terms produce different results per IV step, causing the verifier to think the stride was non-constant. Pass div_fwd substitutions into the verifier so probed values always satisfy the constraints. This eliminates all residual affine.apply ops for B-data and B-scale index offsets in the loop body (12 per test for 1x4, 28 for 2x2). The only remaining affine.apply in the loop body is a single bounds-check computation unrelated to data/scale index offsets. Made-with: Cursor
bd8d704 to
eb5b1f8
Compare
The affine-apply hoisting optimization reduces register counts with a locally-built LLVM, but CI's pinned LLVM version produces different register allocation. The water backend waitcounts changed as expected, but register counts remain at main-branch levels (172 VGPRs, 61 SGPRs) in CI. Change Details: - Revert water backend vgpr_count from 164 back to 172 and sgpr_count from 57 back to 61 to match CI-observed values Made-with: Cursor
Change Details: - Skip _build_mask and arith_d.select on gather_to_lds LINEAR_INDEX and N-D paths when valid_bytes_override is set from g2s_guard, so hardware num_records clamping handles OOB and VGPR pressure drops. - Align no_masked_load_store_ops with ogsplit for buffer ops + asm backend so masked loads are allowed and lowering can use hardware OOB where appropriate. Made-with: Cursor
Water Code Coverage |
harsh-nod
approved these changes
Apr 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When dimensions are dynamic, the linearized index expressions passed to affine.apply contain the loop induction variable, preventing LICM from moving them out of the loop body. Split each such expression into
base + iv * stride, generate the base before the loop with gen_sympy_index_hoisted, and emit the IV-dependent part as plain arith ops that downstream LICM can reason about independently.The key test is
lit_tests/kernel/wave/linearize_loop_affine_maps.pywhich shows that noaffine.applyops occur inside a loop body.Key changes:
Add gen_sympy_index_hoisted (emitter.py): decomposes a sympy expression into a loop-invariant base (hoisted) and an IV-scaled offset (emitted in-place), with reconstruction guards that fall back to the original gen_sympy_index when the split is not provably safe.
Refactor read_write.py: extract _get_enclosing_scf_for (walks parent regions), _hoist_before_loop, _iv_context, _gen_linear_index_offset, _hoist_dst_indices, and _linearize_memref_hoisted to replace ad-hoc hoisting scattered across _handle_read_linear_index and handle_gather_to_lds.
Rework _build_mask to lower each bound condition individually with hoisting support, replacing the previous functools.reduce over a single sympy And expression.
Add linearize_loop_affine_maps.py lit test verifying no affine.apply ops remain inside the pipelined scf.for body for two wave shapes.
Update CHECK patterns in 5 existing lit tests to reflect the new code shape (fewer affine_map parameters, hoisted base ops before the loop, arith.muli/addi inside the loop).
Made-with: Cursor