Perf(spmd_paged_attention): prefetch K/V and reuse block ids in AIC by chenshengxin2026 · Pull Request #704 · hw-native-sys/simpler

chenshengxin2026 · 2026-04-30T03:02:29Z

Summary

Performance optimization for tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention, targeting the AIC pipeline in paged_attention_parallel.cpp.

AIC QK prefetch: issue TLOAD for the next block's K before the current QK's MTE1 -> M sync, so the next-block fetch overlaps the current step's TMOV / TMATMUL / TPUSH path on PIPE_MTE2.
AIC PV prefetch: same prefetch pattern for V in the PV step (uses EVENT_ID1 to avoid aliasing with QK's EVENT_ID0).
Reuse already-read block ids: hoist bt[bt_offset + i] reads up to aic_process_blocks and pass the resolved kv_block_id into the step helpers. Each block id is read from bt[] once and reused — prev_block_id feeds PV[i-1], current block_id feeds QK[i], and a one-step next_block_id lookahead drives the prefetch.

The step helpers gain three optional flags — current_loaded (skip the inline TLOAD because the previous step already prefetched it), has_next (issue the lookahead TLOAD), and next_kv_block_id. Prologue / steady state / epilogue in aic_process_blocks set these flags so:

Prologue QK[0]: no current_loaded, prefetches K[1].
Steady QK[i] (i>=1): current K[i] already loaded by previous step's prefetch; if i+1 < n, prefetch K[i+1].
Steady PV[i-1]: V[i-1] already loaded for i>=2; if i < n, prefetch V[i].
Epilogue PV[n-1]: V[n-1] already loaded by the prior PV's prefetch when n_blocks > 1.
Degenerate n_blocks == 1: no prefetch, single QK + PV with their own loads.

No semantic / numerical change — the loaded data and matmul order are identical, only the issue timing of the next TLOAD shifts earlier within the same step.

Performance

Measured via tools/benchmark_rounds.sh on hardware.

Before (baseline on upstream/main):

================================================================
  Performance Summary (tensormap_and_ringbuffer)
================================================================

  Example                                   Elapsed (us)    Sched (us)     Orch (us)
  ----------------------------------------  ------------  ------------  ------------
  spmd_paged_attention (Case1)                    1501.8        1501.7           7.0
  spmd_paged_attention (Case2)                     752.5         752.4           6.7

================================================================
  Benchmark complete (tensormap_and_ringbuffer): 2 passed, 0 failed (2 total)
================================================================

After (this PR):

================================================================
  Performance Summary (tensormap_and_ringbuffer)
================================================================

  Example                                   Elapsed (us)    Sched (us)     Orch (us)
  ----------------------------------------  ------------  ------------  ------------
  spmd_paged_attention (Case1)                    1316.5        1316.5           6.2
  spmd_paged_attention (Case2)                     694.4         694.4           6.1

================================================================
  Benchmark complete (tensormap_and_ringbuffer): 2 passed, 0 failed (2 total)
================================================================

Delta:

Case	Before (us)	After (us)	Δ (us)	Improvement
Case1	1501.8	1316.5	-185.3	-12.3%
Case2	752.5	694.4	-58.1	-7.7%

Testing

Simulation: pytest tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention --platform a2a3sim
Hardware: pytest tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention --platform a2a3 --device <range>

gemini-code-assist

Code Review

This pull request introduces a pre-fetching mechanism in the aic_qk_step and aic_pv_step kernels to overlap data loading with computation. By passing block loading status and next block IDs, the implementation aims to improve pipeline efficiency. Feedback suggests optimizing the loop in aic_process_blocks by carrying the next block ID forward to avoid redundant global memory reads from the block table array.

- AIC QK: issue TLOAD for the next block's K before the current QK's MTE1->M sync so the next-block fetch overlaps the current step's TMOV/TMATMUL/TPUSH path. - AIC PV: same prefetch pattern for V in the PV step. - Hoist bt[bt_offset + i] reads to aic_process_blocks and pass the resolved kv_block_id into the step helpers; reuse prev_block_id for PV[i-1] so each block id is read once, not twice.

PR hw-native-sys#704's K/V prefetch reused EVENT_ID0/EVENT_ID1 across both ping-pong L1 slots and across QK/PV steps. Under multi-round stress this let an MTE2 TLOAD into slot (i+1) race against MTE1 reading slot i, producing intermittent precision failures in tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention. - Use per-slot ping-pong events for QK/PV L1 (MTE1<->MTE2) and L0 (MTE1<->M) so each slot has its own RAW/WAR sync and the prefetch for slot (i+1) cannot collide with the matmul of slot i. - Drain TPUSH(sij) via PIPE_FIX -> PIPE_S before record() so AIV TPOP(sij) only observes a fully written GM FIFO. - Move the next-block TLOAD in the QK step to after record() -- the earliest safe point that avoids reintroducing a coarse PIPE_ALL barrier. Verified: 20-round stress + standard precision pass on a2a3.

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread .../a2a3/tensormap_and_ringbuffer/spmd_paged_attention/kernels/mix/paged_attention_parallel.cpp

chenshengxin2026 force-pushed the feat/spmd-pa-aic-kv-prefetch branch from 81de7e4 to 68760aa Compare April 30, 2026 03:10

chenshengxin2026 changed the title ~~Update: prefetch K/V and reuse block ids in spmd_paged_attention AIC~~ Perf(spmd_paged_attention): prefetch K/V and reuse block ids in AIC Apr 30, 2026

chenshengxin2026 force-pushed the feat/spmd-pa-aic-kv-prefetch branch from 68760aa to 3ef6602 Compare April 30, 2026 03:22

ChaoWao approved these changes Apr 30, 2026

View reviewed changes

ChaoWao merged commit 9ac5d37 into hw-native-sys:main Apr 30, 2026
13 checks passed

chenshengxin2026 mentioned this pull request Apr 30, 2026

Fix(spmd_paged_attention): per-slot sync events for K/V prefetch #708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf(spmd_paged_attention): prefetch K/V and reuse block ids in AIC#704

Perf(spmd_paged_attention): prefetch K/V and reuse block ids in AIC#704
ChaoWao merged 1 commit intohw-native-sys:mainfrom
chenshengxin2026:feat/spmd-pa-aic-kv-prefetch

chenshengxin2026 commented Apr 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chenshengxin2026 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenshengxin2026 commented Apr 30, 2026 •

edited

Loading