Summary
Add an optional entry_offset: int = 0 (byte offset) parameter to pto.tpush, pto.tpop, and their legacy tpush_to_aiv / tpop_from_aic variants, plus corresponding allocate and record boolean controls. This exposes the hardware's setEntryOffset / setAllocateStatus / setRecordStatus capabilities already used in fa_performance_kernel.cpp, allowing a single FIFO slot to be filled and consumed in multiple narrow sub-tile writes rather than one full-width write.
Motivation / use case
fa_performance_kernel.cpp splits each logical K tile ([HEAD, Tile_S1]) into kTileFactor = Tile_S1 / Cube_S1 narrow sub-matmuls. Each sub-matmul operates on a [HEAD, Cube_S1] K tile and writes a [Cube_S0, Cube_S1] QK result into a different column-offset of the same [Cube_S0, Tile_S1] FIFO slot:
// kTileFactor = 2, Cube_S1 = 128, Tile_S1 = 256
for (int sub_tile = 0; sub_tile < kTileFactor; sub_tile++) {
GlobalDataK kGlobal(k + s1_index * HEAD_SIZE); // narrow [HEAD, Cube_S1] load
TLOAD(kMatTile[pingpong], kGlobal);
pto_macro_matmul<Cube_S0, HEAD, Cube_S1>(..., AccMode::InitFinalSum);
qkPipe.prod.setAllocateStatus(sub_tile == 0 && need_backpressure);
qkPipe.prod.setRecordStatus(sub_tile == kTileFactor - 1);
qkPipe.prod.setEntryOffset(sub_tile * Cube_S0 * Cube_S1 * sizeof(float));
TPUSH(qkAccTile, qkPipe);
}
This pattern enables three compounding performance benefits that are impossible to express in the current DSL:
- Smaller tile working set —
[HEAD, Cube_S1] L0B tiles fit a larger Cube_K in pto_macro_matmul, reducing TEXTRACT/TMATMUL segment count. At HEAD=128, Cube_S1=128: Cube_K=128, kSegments=1 vs Cube_K=32, kSegments=4 for a wide [HEAD, 512] tile.
- K tile double-buffering — narrow K tiles (
kMatTNBuffers=2) let MTE2 load of sub-tile i+1 overlap with the M matmul of sub-tile i. Not possible with a single wide tile.
- Smaller ACC footprint — the ACC tile is
[Cube_S0, Cube_S1] instead of [Cube_S0, Tile_S1], enabling ACC ping-pong between two L0C addresses.
Without this primitive, the DSL is equivalent to kTileFactor=1 (one full-width matmul per tile), which is measurably slower at the reference shape (HEAD=128, Cube_S1=128, Tile_S1=256, QK_PRELOAD=4).
Proposed API / behavior
MLIR op level — add attributes to pto.tpush and pto.tpop:
pto.tpush %qk_acc, %pipe { split = 1 : i8, entry_offset = 0 : i64,
allocate = true, record = false }
pto.tpush %qk_acc, %pipe { split = 1 : i8, entry_offset = 32768 : i64,
allocate = false, record = true }
All three new parameters default to their current implicit values (entry_offset=0, allocate=true, record=true) so all existing call sites are unaffected.
Alternatives considered
Column subviews on RIGHT/MAT tile buffers — load a wide [HEAD, Tile_S1] tile and subview it as two [HEAD, Cube_S1] rhs operands for two matmuls. ptoas 0.32 removed the MLIR verifier restriction on these subviews, but the tmatmul lowering still emits incorrect DMA descriptors for strided RIGHT operands at runtime (separate bug report filed). Even if the codegen were fixed, this approach still loads the full wide tile (same DMA cost), cannot double-buffer K, and measured at 0.94× vs 0.97× for the current wide single-matmul path — net negative at our shapes.
Additional context
- Reference file:
fa_performance_kernel.cpp, template runTFA<>, function compute_qk.
- The hardware
TMPipe already supports all three controls via setAllocateStatus / setRecordStatus / setEntryOffset before each TPUSH/TPOP. This is a MLIR API surface gap only — no new compiler pass or hardware feature is needed.
- The same
entry_offset pattern applies to all three FIFO directions in the FA pipeline: QK cube→vec push (column offset by sub_tile_id), P vec→cube push (column offset by row_offset), and PV cube→vec push (always entry_offset=0, included for API consistency).
- This is the primary blocking item for porting
fa_performance_kernel.cpp to the DSL at its native shape (HEAD=128, Cube_S1=128, Tile_S1=256, QK_PRELOAD=4), which cannot be expressed at all today.
Summary
Add an optional
entry_offset: int = 0(byte offset) parameter topto.tpush,pto.tpop, and their legacytpush_to_aiv/tpop_from_aicvariants, plus correspondingallocateandrecordboolean controls. This exposes the hardware'ssetEntryOffset/setAllocateStatus/setRecordStatuscapabilities already used infa_performance_kernel.cpp, allowing a single FIFO slot to be filled and consumed in multiple narrow sub-tile writes rather than one full-width write.Motivation / use case
fa_performance_kernel.cppsplits each logical K tile ([HEAD, Tile_S1]) intokTileFactor = Tile_S1 / Cube_S1narrow sub-matmuls. Each sub-matmul operates on a[HEAD, Cube_S1]K tile and writes a[Cube_S0, Cube_S1]QK result into a different column-offset of the same[Cube_S0, Tile_S1]FIFO slot:This pattern enables three compounding performance benefits that are impossible to express in the current DSL:
[HEAD, Cube_S1]L0B tiles fit a largerCube_Kinpto_macro_matmul, reducing TEXTRACT/TMATMUL segment count. AtHEAD=128, Cube_S1=128:Cube_K=128,kSegments=1vsCube_K=32,kSegments=4for a wide[HEAD, 512]tile.kMatTNBuffers=2) let MTE2 load of sub-tilei+1overlap with the M matmul of sub-tilei. Not possible with a single wide tile.[Cube_S0, Cube_S1]instead of[Cube_S0, Tile_S1], enabling ACC ping-pong between two L0C addresses.Without this primitive, the DSL is equivalent to
kTileFactor=1(one full-width matmul per tile), which is measurably slower at the reference shape(HEAD=128, Cube_S1=128, Tile_S1=256, QK_PRELOAD=4).Proposed API / behavior
MLIR op level — add attributes to
pto.tpushandpto.tpop:All three new parameters default to their current implicit values (
entry_offset=0,allocate=true,record=true) so all existing call sites are unaffected.Alternatives considered
Column subviews on RIGHT/MAT tile buffers — load a wide
[HEAD, Tile_S1]tile and subview it as two[HEAD, Cube_S1]rhs operands for two matmuls. ptoas 0.32 removed the MLIR verifier restriction on these subviews, but thetmatmullowering still emits incorrect DMA descriptors for strided RIGHT operands at runtime (separate bug report filed). Even if the codegen were fixed, this approach still loads the full wide tile (same DMA cost), cannot double-buffer K, and measured at 0.94× vs 0.97× for the current wide single-matmul path — net negative at our shapes.Additional context
fa_performance_kernel.cpp, templaterunTFA<>, functioncompute_qk.TMPipealready supports all three controls viasetAllocateStatus/setRecordStatus/setEntryOffsetbefore eachTPUSH/TPOP. This is a MLIR API surface gap only — no new compiler pass or hardware feature is needed.entry_offsetpattern applies to all three FIFO directions in the FA pipeline: QK cube→vec push (column offset bysub_tile_id), P vec→cube push (column offset byrow_offset), and PV cube→vec push (alwaysentry_offset=0, included for API consistency).fa_performance_kernel.cppto the DSL at its native shape(HEAD=128, Cube_S1=128, Tile_S1=256, QK_PRELOAD=4), which cannot be expressed at all today.