Skip to content

fix(codegen): Reuse matmul acc buffer#1216

Open
lwDavid wants to merge 1 commit intohw-native-sys:mainfrom
lwDavid:issue-1213-matmul-acc-buffer
Open

fix(codegen): Reuse matmul acc buffer#1216
lwDavid wants to merge 1 commit intohw-native-sys:mainfrom
lwDavid:issue-1213-matmul-acc-buffer

Conversation

@lwDavid
Copy link
Copy Markdown
Contributor

@lwDavid lwDavid commented Apr 29, 2026

Fixes #1213

Lower in-place tile accumulator ops with the accumulator SSA as both ins and outs, and keep the assignment result bound to that same buffer so the final store reads the updated accumulator.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Ensure in-place accumulator codegen uses the accumulator operand's tile-buffer SSA and type for tile.matmul_acc / tile.gemv_acc, bind the assignment result to that SSA, and prevent duplicate pto.alloc_tile emissions; add unit and runtime tests validating accumulator SSA preservation across loop-carried matmul_acc.

Changes

Cohort / File(s) Summary
Backend in-place accumulation codegen
src/backend/common/pto_ops_common.cpp
Extract accumulator destination SSA/type from the IR accumulator input (op->args_[0]), validate presence of a tile-buffer SSA, call codegen.SetCurrentResultBuf(dst) to bind result to that SSA, and emit pto.tmatmul.acc/pto.tgemv.acc with matching ins(acc, ...) and outs(acc).
PTO codegen allocation & SSA reuse
src/codegen/pto/pto_codegen.cpp, include/pypto/codegen/pto/pto_codegen.h
Suppress automatic tile allocation for in-place accumulator destinations during AssignStmt handling; add conditional reuse of tile-buffer SSA for statically-shaped operand MemRefs keyed by base MemRef + extents; add emitted_tile_alloc_ssas to track and deduplicate emitted pto.alloc_tile by SSA.
Unit & runtime tests
tests/ut/codegen/test_pto_codegen.py, tests/st/runtime/test_matmul.py
New MLIR helper for single-function output and tests: unit test checks SSA consistency/reuse across pto.tmatmulpto.tmatmul.accpto.tstore; runtime regression adds matmul_acc case with b_trans=True splitting K to validate numerical correctness on a2a3 (FP32/BF16).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • lyfne123
  • zhangqi-chen

Poem

🐰 I hop through SSA fields at dawn,
I bind the acc where it was drawn,
No duped tiles to crowd the way,
Loop-carried buffers save the day;
Tests nibble green — the numbers sing! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(codegen): Reuse matmul acc buffer' directly and concisely summarizes the main change: fixing the codegen to properly reuse the accumulator buffer in matmul operations.
Description check ✅ Passed The description explains the fix (using accumulator SSA as both ins and outs) and its purpose (ensuring the final store reads the updated accumulator), and references issue #1213 which matches the problem addressed in the PR.
Linked Issues check ✅ Passed The PR directly addresses issue #1213 by fixing accumulator buffer handling in tile.matmul_acc and tile.gemv_acc codegen. Changes to pto_ops_common.cpp and pto_codegen.cpp implement proper accumulator SSA reuse, and new tests verify the fix works correctly.
Out of Scope Changes check ✅ Passed All changes are focused on fixing the matmul_acc accumulator buffer issue: internal codegen logic, accumulator SSA tracking, tile allocation deduplication, and related tests. No extraneous changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lwDavid lwDavid self-assigned this Apr 29, 2026
@lwDavid lwDavid added the bug Something isn't working label Apr 29, 2026
@lwDavid lwDavid moved this to In Progress in pto project Apr 29, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the PTO codegen to handle in-place accumulation operations (matmul_acc and gemv_acc) by reusing the accumulator's SSA value for both input and output. It updates make_acc_codegen to bind the result to the input accumulator's buffer and modifies the assignment visitor to skip redundant tile allocations for these operations. A new unit test verifies that loop-carried accumulators correctly share buffers in the generated MLIR. I have no feedback to provide.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/ut/codegen/test_pto_codegen.py (1)

1565-1605: Please add the tile.gemv_acc sibling case too.

This test locks down the tile.matmul_acc fix well, but the production change also updates tile.gemv_acc through the same custom codegen and alloc-suppression path. A small parametrized variant here would keep that sibling path from regressing silently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ut/codegen/test_pto_codegen.py` around lines 1565 - 1605, Extend the
existing test_pto_codegen_matmul_acc_uses_loop_carried_accumulator_buffer to
also cover the sibling gemv path: add a parametrized variant (or a second
similar test) that uses pl.gemv_acc instead of pl.matmul_acc (and initial accum
via pl.gemv or matching matvec op), then generate MLIR and assert the
accumulator buffer is loop-carried and used by the final store by searching for
"pto.tgemv.acc" (parallel to the existing "pto.tmatmul.acc") and verifying the
same ins/outs operand identity checks as done for matmul_acc; update the MLIR
line-match regexes to look for "pto.tgemv.acc" and the store accordingly so the
gemv_acc codegen path is locked down.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/ut/codegen/test_pto_codegen.py`:
- Around line 1565-1605: Extend the existing
test_pto_codegen_matmul_acc_uses_loop_carried_accumulator_buffer to also cover
the sibling gemv path: add a parametrized variant (or a second similar test)
that uses pl.gemv_acc instead of pl.matmul_acc (and initial accum via pl.gemv or
matching matvec op), then generate MLIR and assert the accumulator buffer is
loop-carried and used by the final store by searching for "pto.tgemv.acc"
(parallel to the existing "pto.tmatmul.acc") and verifying the same ins/outs
operand identity checks as done for matmul_acc; update the MLIR line-match
regexes to look for "pto.tgemv.acc" and the store accordingly so the gemv_acc
codegen path is locked down.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 899cae86-5ec2-4643-829a-9fcf7fe00523

📥 Commits

Reviewing files that changed from the base of the PR and between e2c409d and 42efdf7.

📒 Files selected for processing (3)
  • src/backend/common/pto_ops_common.cpp
  • src/codegen/pto/pto_codegen.cpp
  • tests/ut/codegen/test_pto_codegen.py

@lwDavid lwDavid force-pushed the issue-1213-matmul-acc-buffer branch 2 times, most recently from 93df1be to 5a1bf01 Compare April 29, 2026 09:03
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/codegen/pto/pto_codegen.cpp (1)

69-71: Add coverage for tile.gemv_acc as well.

This helper opts tile.gemv_acc into the same in-place accumulator path as tile.matmul_acc, but the new regressions only assert the matmul branch. A small MLIR test that checks pto.tgemv.acc keeps ins(acc) and outs(acc) on the same SSA would keep the sibling path from drifting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/codegen/pto/pto_codegen.cpp` around lines 69 - 71, The helper
IsInPlaceAccumulatorCall currently only returns true for "tile.matmul_acc" but
the review requests opting "tile.gemv_acc" into the same in-place accumulator
path; update IsInPlaceAccumulatorCall to check for both "tile.matmul_acc" and
"tile.gemv_acc" (i.e., include a second equality check against "tile.gemv_acc")
so the gemv branch follows the same in-place accumulator logic as matmul; ensure
the function still guards for null call and op_ like the existing code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/codegen/pto/pto_codegen.cpp`:
- Around line 244-267: The reuse key currently collapses distinct sub-views by
using memref->base_ plus the tile signature (via GetStaticMatmulOperandReuseKey)
but ignores per-view memref->byte_offset_, causing different views into the same
allocation to be aliased; update the key used by matmul_operand_reuse to include
memref->byte_offset_ (e.g., use std::make_pair(std::tuple(base_ptr,
memref->byte_offset_), *reuse_key) or otherwise append byte_offset to the key)
and also adjust the alloc_tile emission/suppression logic that skips emitting a
second pto.alloc_tile when a reuse hit occurs (the code around the NewNamedTemp/
matmul_operand_reuse lookup and the later alloc_tile suppression) so that
suppression only happens when base and byte_offset (and signature) truly match.
Ensure you reference/modify GetStaticMatmulOperandReuseKey usage,
matmul_operand_reuse, memref->base_, memref->byte_offset_, and the alloc_tile
emission path to apply the fix.

---

Nitpick comments:
In `@src/codegen/pto/pto_codegen.cpp`:
- Around line 69-71: The helper IsInPlaceAccumulatorCall currently only returns
true for "tile.matmul_acc" but the review requests opting "tile.gemv_acc" into
the same in-place accumulator path; update IsInPlaceAccumulatorCall to check for
both "tile.matmul_acc" and "tile.gemv_acc" (i.e., include a second equality
check against "tile.gemv_acc") so the gemv branch follows the same in-place
accumulator logic as matmul; ensure the function still guards for null call and
op_ like the existing code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 506e27c2-395c-402f-9e73-2689f9bca268

📥 Commits

Reviewing files that changed from the base of the PR and between 42efdf7 and 5a1bf01.

📒 Files selected for processing (5)
  • include/pypto/codegen/pto/pto_codegen.h
  • src/backend/common/pto_ops_common.cpp
  • src/codegen/pto/pto_codegen.cpp
  • tests/st/runtime/test_matmul.py
  • tests/ut/codegen/test_pto_codegen.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/backend/common/pto_ops_common.cpp

Comment on lines +244 to +267
std::map<std::pair<const ir::Var*, std::string>, std::string> matmul_operand_reuse;
for (const auto& [tile_var, tile_type] : fs_.tile_var_allocs) {
std::string ssa_name = NewNamedTemp(tile_var->name_hint_);
BindVarToMlir(tile_var, ssa_name);

// Pre-populate type so body visitors (e.g., tile.reshape no-op check)
// can query it before per-variable alloc_tile emission runs. Tile types
// are always emitted with `v_row=?, v_col=?`; the actual extents flow
// through the alloc_tile valid_row/valid_col operands.
std::string type_str = GetTileBufTypeStringFromTileType(tile_type);
fs_.ssa_to_tile_buf_type[ssa_name] = type_str;

auto memref = ir::GetDefinedMemRef(tile_type);
const ir::Var* base_ptr = memref->base_.get();

std::string ssa_name;
auto reuse_key = GetStaticMatmulOperandReuseKey(tile_type, type_str);
if (reuse_key.has_value() && fs_.tpop_result_vars.count(tile_var.get()) == 0) {
auto key = std::make_pair(base_ptr, *reuse_key);
auto reuse_it = matmul_operand_reuse.find(key);
if (reuse_it != matmul_operand_reuse.end()) {
ssa_name = reuse_it->second;
} else {
ssa_name = NewNamedTemp(tile_var->name_hint_);
matmul_operand_reuse.emplace(std::move(key), ssa_name);
}
} else {
ssa_name = NewNamedTemp(tile_var->name_hint_);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reuse key collapses distinct sub-buffers that only share the same allocation base.

Line 252 keys matmul_operand_reuse by memref->base_ plus the tile signature, but not by memref->byte_offset_. In this IR model, multiple views can intentionally share the same base_ while still pointing at different regions. Once that happens, Lines 718-720 suppress the second pto.alloc_tile, so both operands end up bound to the first address and later matmul/gemv ops can read the wrong L0A/L0B buffer.

💡 Suggested fix
-  std::map<std::pair<const ir::Var*, std::string>, std::string> matmul_operand_reuse;
+  std::map<std::tuple<const ir::Var*, int64_t, std::string>, std::string> matmul_operand_reuse;
   for (const auto& [tile_var, tile_type] : fs_.tile_var_allocs) {
     std::string type_str = GetTileBufTypeStringFromTileType(tile_type);
     auto memref = ir::GetDefinedMemRef(tile_type);
     const ir::Var* base_ptr = memref->base_.get();
+    auto const_offset = As<ir::ConstInt>(memref->byte_offset_);
+    INTERNAL_CHECK_SPAN(const_offset != nullptr, tile_var->span_)
+        << "Expected static on-chip byte_offset for matmul operand reuse";

     std::string ssa_name;
     auto reuse_key = GetStaticMatmulOperandReuseKey(tile_type, type_str);
     if (reuse_key.has_value() && fs_.tpop_result_vars.count(tile_var.get()) == 0) {
-      auto key = std::make_pair(base_ptr, *reuse_key);
+      auto key = std::make_tuple(base_ptr, const_offset->value_, *reuse_key);
       auto reuse_it = matmul_operand_reuse.find(key);
       if (reuse_it != matmul_operand_reuse.end()) {
         ssa_name = reuse_it->second;
       } else {
         ssa_name = NewNamedTemp(tile_var->name_hint_);

Based on learnings, this codebase uses MemRef::base_ as allocation identity while preserving per-view byte_offset_/size_ separately after memory reuse.

Also applies to: 718-720

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/codegen/pto/pto_codegen.cpp` around lines 244 - 267, The reuse key
currently collapses distinct sub-views by using memref->base_ plus the tile
signature (via GetStaticMatmulOperandReuseKey) but ignores per-view
memref->byte_offset_, causing different views into the same allocation to be
aliased; update the key used by matmul_operand_reuse to include
memref->byte_offset_ (e.g., use std::make_pair(std::tuple(base_ptr,
memref->byte_offset_), *reuse_key) or otherwise append byte_offset to the key)
and also adjust the alloc_tile emission/suppression logic that skips emitting a
second pto.alloc_tile when a reuse hit occurs (the code around the NewNamedTemp/
matmul_operand_reuse lookup and the later alloc_tile suppression) so that
suppression only happens when base and byte_offset (and signature) truly match.
Ensure you reference/modify GetStaticMatmulOperandReuseKey usage,
matmul_operand_reuse, memref->base_, memref->byte_offset_, and the alloc_tile
emission path to apply the fix.

@lwDavid lwDavid force-pushed the issue-1213-matmul-acc-buffer branch from 5a1bf01 to d4c6cf9 Compare April 29, 2026 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[Bug] a2a3 FP32 matmul with b_trans=True gives wrong results for [16,16384] x [32,16384].T

1 participant