Skip to content

Perf(spmd_paged_attention): prefetch K/V and reuse block ids in AIC#704

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
chenshengxin2026:feat/spmd-pa-aic-kv-prefetch
Apr 30, 2026
Merged

Perf(spmd_paged_attention): prefetch K/V and reuse block ids in AIC#704
ChaoWao merged 1 commit intohw-native-sys:mainfrom
chenshengxin2026:feat/spmd-pa-aic-kv-prefetch

Conversation

@chenshengxin2026
Copy link
Copy Markdown
Contributor

@chenshengxin2026 chenshengxin2026 commented Apr 30, 2026

Summary

Performance optimization for tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention, targeting the AIC pipeline in paged_attention_parallel.cpp.

  • AIC QK prefetch: issue TLOAD for the next block's K before the current QK's MTE1 -> M sync, so the next-block fetch overlaps the current step's TMOV / TMATMUL / TPUSH path on PIPE_MTE2.
  • AIC PV prefetch: same prefetch pattern for V in the PV step (uses EVENT_ID1 to avoid aliasing with QK's EVENT_ID0).
  • Reuse already-read block ids: hoist bt[bt_offset + i] reads up to aic_process_blocks and pass the resolved kv_block_id into the step helpers. Each block id is read from bt[] once and reused — prev_block_id feeds PV[i-1], current block_id feeds QK[i], and a one-step next_block_id lookahead drives the prefetch.

The step helpers gain three optional flags — current_loaded (skip the inline TLOAD because the previous step already prefetched it), has_next (issue the lookahead TLOAD), and next_kv_block_id. Prologue / steady state / epilogue in aic_process_blocks set these flags so:

  • Prologue QK[0]: no current_loaded, prefetches K[1].
  • Steady QK[i] (i>=1): current K[i] already loaded by previous step's prefetch; if i+1 < n, prefetch K[i+1].
  • Steady PV[i-1]: V[i-1] already loaded for i>=2; if i < n, prefetch V[i].
  • Epilogue PV[n-1]: V[n-1] already loaded by the prior PV's prefetch when n_blocks > 1.
  • Degenerate n_blocks == 1: no prefetch, single QK + PV with their own loads.

No semantic / numerical change — the loaded data and matmul order are identical, only the issue timing of the next TLOAD shifts earlier within the same step.

Performance

Measured via tools/benchmark_rounds.sh on hardware.

Before (baseline on upstream/main):

================================================================
  Performance Summary (tensormap_and_ringbuffer)
================================================================

  Example                                   Elapsed (us)    Sched (us)     Orch (us)
  ----------------------------------------  ------------  ------------  ------------
  spmd_paged_attention (Case1)                    1501.8        1501.7           7.0
  spmd_paged_attention (Case2)                     752.5         752.4           6.7

================================================================
  Benchmark complete (tensormap_and_ringbuffer): 2 passed, 0 failed (2 total)
================================================================

After (this PR):

================================================================
  Performance Summary (tensormap_and_ringbuffer)
================================================================

  Example                                   Elapsed (us)    Sched (us)     Orch (us)
  ----------------------------------------  ------------  ------------  ------------
  spmd_paged_attention (Case1)                    1316.5        1316.5           6.2
  spmd_paged_attention (Case2)                     694.4         694.4           6.1

================================================================
  Benchmark complete (tensormap_and_ringbuffer): 2 passed, 0 failed (2 total)
================================================================

Delta:

Case Before (us) After (us) Δ (us) Improvement
Case1 1501.8 1316.5 -185.3 -12.3%
Case2 752.5 694.4 -58.1 -7.7%

Testing

  • Simulation: pytest tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention --platform a2a3sim
  • Hardware: pytest tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention --platform a2a3 --device <range>

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a pre-fetching mechanism in the aic_qk_step and aic_pv_step kernels to overlap data loading with computation. By passing block loading status and next block IDs, the implementation aims to improve pipeline efficiency. Feedback suggests optimizing the loop in aic_process_blocks by carrying the next block ID forward to avoid redundant global memory reads from the block table array.

@chenshengxin2026 chenshengxin2026 force-pushed the feat/spmd-pa-aic-kv-prefetch branch from 81de7e4 to 68760aa Compare April 30, 2026 03:10
@chenshengxin2026 chenshengxin2026 changed the title Update: prefetch K/V and reuse block ids in spmd_paged_attention AIC Perf(spmd_paged_attention): prefetch K/V and reuse block ids in AIC Apr 30, 2026
- AIC QK: issue TLOAD for the next block's K before the current QK's
  MTE1->M sync so the next-block fetch overlaps the current step's
  TMOV/TMATMUL/TPUSH path.
- AIC PV: same prefetch pattern for V in the PV step.
- Hoist bt[bt_offset + i] reads to aic_process_blocks and pass the
  resolved kv_block_id into the step helpers; reuse prev_block_id for
  PV[i-1] so each block id is read once, not twice.
@chenshengxin2026 chenshengxin2026 force-pushed the feat/spmd-pa-aic-kv-prefetch branch from 68760aa to 3ef6602 Compare April 30, 2026 03:22
@ChaoWao ChaoWao merged commit 9ac5d37 into hw-native-sys:main Apr 30, 2026
13 checks passed
chenshengxin2026 added a commit to chenshengxin2026/simpler that referenced this pull request Apr 30, 2026
PR hw-native-sys#704's K/V prefetch reused EVENT_ID0/EVENT_ID1 across both
ping-pong L1 slots and across QK/PV steps. Under multi-round
stress this let an MTE2 TLOAD into slot (i+1) race against
MTE1 reading slot i, producing intermittent precision failures
in tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention.

- Use per-slot ping-pong events for QK/PV L1 (MTE1<->MTE2) and
  L0 (MTE1<->M) so each slot has its own RAW/WAR sync and the
  prefetch for slot (i+1) cannot collide with the matmul of slot i.
- Drain TPUSH(sij) via PIPE_FIX -> PIPE_S before record() so AIV
  TPOP(sij) only observes a fully written GM FIFO.
- Move the next-block TLOAD in the QK step to after record() --
  the earliest safe point that avoids reintroducing a coarse
  PIPE_ALL barrier.

Verified: 20-round stress + standard precision pass on a2a3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants