Perf(spmd_paged_attention): prefetch K/V and reuse block ids in AIC#704
Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom Apr 30, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a pre-fetching mechanism in the aic_qk_step and aic_pv_step kernels to overlap data loading with computation. By passing block loading status and next block IDs, the implementation aims to improve pipeline efficiency. Feedback suggests optimizing the loop in aic_process_blocks by carrying the next block ID forward to avoid redundant global memory reads from the block table array.
81de7e4 to
68760aa
Compare
- AIC QK: issue TLOAD for the next block's K before the current QK's MTE1->M sync so the next-block fetch overlaps the current step's TMOV/TMATMUL/TPUSH path. - AIC PV: same prefetch pattern for V in the PV step. - Hoist bt[bt_offset + i] reads to aic_process_blocks and pass the resolved kv_block_id into the step helpers; reuse prev_block_id for PV[i-1] so each block id is read once, not twice.
68760aa to
3ef6602
Compare
ChaoWao
approved these changes
Apr 30, 2026
chenshengxin2026
added a commit
to chenshengxin2026/simpler
that referenced
this pull request
Apr 30, 2026
PR hw-native-sys#704's K/V prefetch reused EVENT_ID0/EVENT_ID1 across both ping-pong L1 slots and across QK/PV steps. Under multi-round stress this let an MTE2 TLOAD into slot (i+1) race against MTE1 reading slot i, producing intermittent precision failures in tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention. - Use per-slot ping-pong events for QK/PV L1 (MTE1<->MTE2) and L0 (MTE1<->M) so each slot has its own RAW/WAR sync and the prefetch for slot (i+1) cannot collide with the matmul of slot i. - Drain TPUSH(sij) via PIPE_FIX -> PIPE_S before record() so AIV TPOP(sij) only observes a fully written GM FIFO. - Move the next-block TLOAD in the QK step to after record() -- the earliest safe point that avoids reintroducing a coarse PIPE_ALL barrier. Verified: 20-round stress + standard precision pass on a2a3.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance optimization for
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention, targeting the AIC pipeline inpaged_attention_parallel.cpp.TLOADfor the next block's K before the current QK'sMTE1 -> Msync, so the next-block fetch overlaps the current step'sTMOV/TMATMUL/TPUSHpath on PIPE_MTE2.bt[bt_offset + i]reads up toaic_process_blocksand pass the resolvedkv_block_idinto the step helpers. Each block id is read frombt[]once and reused —prev_block_idfeedsPV[i-1], currentblock_idfeedsQK[i], and a one-stepnext_block_idlookahead drives the prefetch.The step helpers gain three optional flags —
current_loaded(skip the inline TLOAD because the previous step already prefetched it),has_next(issue the lookahead TLOAD), andnext_kv_block_id. Prologue / steady state / epilogue inaic_process_blocksset these flags so:n_blocks == 1: no prefetch, single QK + PV with their own loads.No semantic / numerical change — the loaded data and matmul order are identical, only the issue timing of the next
TLOADshifts earlier within the same step.Performance
Measured via
tools/benchmark_rounds.shon hardware.Before (baseline on
upstream/main):After (this PR):
Delta:
Testing
pytest tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention --platform a2a3simpytest tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention --platform a2a3 --device <range>