Skip to content

Fix(spmd_paged_attention): per-slot sync events for K/V prefetch#708

Open
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:feat/spmd-pa-aic-kv-prefetch
Open

Fix(spmd_paged_attention): per-slot sync events for K/V prefetch#708
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:feat/spmd-pa-aic-kv-prefetch

Conversation

@chenshengxin2026
Copy link
Copy Markdown
Contributor

@chenshengxin2026 chenshengxin2026 commented Apr 30, 2026

Summary

Fixes the intermittent precision failure in
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention.

  • Use per-slot ping-pong events for QK/PV L1 (MTE1<->MTE2) and
    L0 (MTE1<->M) so each ping-pong slot has its own RAW/WAR sync.
  • Drain TPUSH(sij) via PIPE_FIX -> PIPE_S before record() so
    AIV TPOP(sij) only observes a fully written GM FIFO.
  • Move the next-block TLOAD in the QK step to after record() to
    avoid reintroducing a coarse PIPE_ALL barrier.

Verification

task-submit --device auto --run "python -m pytest tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention --platform a2a3 --rounds 20"

Performance

After fix:

================================================================
  Performance Summary (tensormap_and_ringbuffer)
================================================================

  Example                                   Elapsed (us)    Sched (us)     Orch (us)
  ----------------------------------------  ------------  ------------  ------------
  spmd_paged_attention (Case1)                    1325.3        1325.3           5.8
  spmd_paged_attention (Case2)                     694.4         694.4           6.1

================================================================

Before fix:

================================================================
  Performance Summary (tensormap_and_ringbuffer)
================================================================

  Example                                   Elapsed (us)    Sched (us)     Orch (us)
  ----------------------------------------  ------------  ------------  ------------
  spmd_paged_attention (Case1)                    1316.5        1316.5           6.2
  spmd_paged_attention (Case2)                     694.4         694.4           6.1

================================================================
  Benchmark complete (tensormap_and_ringbuffer): 2 passed, 0 failed (2 total)
================================================================

Related: #704

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements ping-pong buffering and per-slot synchronization for the QK and PV steps in the paged attention kernel. Key changes include the introduction of event base constants, the use of tile arrays for double buffering, and the addition of explicit flag initialization and cleanup to manage pipeline dependencies. I have no feedback to provide as the existing review comments were purely explanatory or validating.

@chenshengxin2026 chenshengxin2026 force-pushed the feat/spmd-pa-aic-kv-prefetch branch 2 times, most recently from d2fba00 to 8b61747 Compare April 30, 2026 09:25
- Replace shared EVENT_ID0/EVENT_ID1 with per-slot events for QK/PV
  L1 (MTE1<->MTE2) and L0 (MTE1<->M) so each ping-pong slot has its
  own RAW/WAR sync.
- Split QK/PV L0 left/right-tile addresses into two-entry arrays
  with disjoint offsets so the slot index selects an independent
  L0 region per iteration.
- Add a dedicated PV_PIJ_EVENT for the TPOP(pij) -> TMOV(aTile_PV)
  path, decoupling pij synchronization from the V-load ping-pong.
- Move the next-block K TLOAD in the QK step to after sij record()
  so the prefetch stays outside the C2V notification critical path.
- Set and drain all eight per-slot events at function entry/exit to
  keep AIC pipeline state consistent across calls.

Verification:

    task-submit --device auto --run "python -m pytest \
      tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention \
      --platform a2a3 --rounds 20"
@chenshengxin2026 chenshengxin2026 force-pushed the feat/spmd-pa-aic-kv-prefetch branch from 8b61747 to 2f57f06 Compare April 30, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant