Skip to content

[Feature] pto.tpush / pto.tpopentry_offset parameter for sub-slot partial push/pop #587

@MirkoDeVita98

Description

@MirkoDeVita98

Summary

Add an optional entry_offset: int = 0 (byte offset) parameter to pto.tpush, pto.tpop, and their legacy tpush_to_aiv / tpop_from_aic variants, plus corresponding allocate and record boolean controls. This exposes the hardware's setEntryOffset / setAllocateStatus / setRecordStatus capabilities already used in fa_performance_kernel.cpp, allowing a single FIFO slot to be filled and consumed in multiple narrow sub-tile writes rather than one full-width write.

Motivation / use case

fa_performance_kernel.cpp splits each logical K tile ([HEAD, Tile_S1]) into kTileFactor = Tile_S1 / Cube_S1 narrow sub-matmuls. Each sub-matmul operates on a [HEAD, Cube_S1] K tile and writes a [Cube_S0, Cube_S1] QK result into a different column-offset of the same [Cube_S0, Tile_S1] FIFO slot:

// kTileFactor = 2, Cube_S1 = 128, Tile_S1 = 256
for (int sub_tile = 0; sub_tile < kTileFactor; sub_tile++) {
    GlobalDataK kGlobal(k + s1_index * HEAD_SIZE);   // narrow [HEAD, Cube_S1] load
    TLOAD(kMatTile[pingpong], kGlobal);
    pto_macro_matmul<Cube_S0, HEAD, Cube_S1>(..., AccMode::InitFinalSum);

    qkPipe.prod.setAllocateStatus(sub_tile == 0 && need_backpressure);
    qkPipe.prod.setRecordStatus(sub_tile == kTileFactor - 1);
    qkPipe.prod.setEntryOffset(sub_tile * Cube_S0 * Cube_S1 * sizeof(float));
    TPUSH(qkAccTile, qkPipe);
}

This pattern enables three compounding performance benefits that are impossible to express in the current DSL:

  1. Smaller tile working set[HEAD, Cube_S1] L0B tiles fit a larger Cube_K in pto_macro_matmul, reducing TEXTRACT/TMATMUL segment count. At HEAD=128, Cube_S1=128: Cube_K=128, kSegments=1 vs Cube_K=32, kSegments=4 for a wide [HEAD, 512] tile.
  2. K tile double-buffering — narrow K tiles (kMatTNBuffers=2) let MTE2 load of sub-tile i+1 overlap with the M matmul of sub-tile i. Not possible with a single wide tile.
  3. Smaller ACC footprint — the ACC tile is [Cube_S0, Cube_S1] instead of [Cube_S0, Tile_S1], enabling ACC ping-pong between two L0C addresses.

Without this primitive, the DSL is equivalent to kTileFactor=1 (one full-width matmul per tile), which is measurably slower at the reference shape (HEAD=128, Cube_S1=128, Tile_S1=256, QK_PRELOAD=4).

Proposed API / behavior

MLIR op level — add attributes to pto.tpush and pto.tpop:

pto.tpush %qk_acc, %pipe { split = 1 : i8, entry_offset = 0 : i64,
                             allocate = true, record = false }
pto.tpush %qk_acc, %pipe { split = 1 : i8, entry_offset = 32768 : i64,
                             allocate = false, record = true }

All three new parameters default to their current implicit values (entry_offset=0, allocate=true, record=true) so all existing call sites are unaffected.

Alternatives considered

Column subviews on RIGHT/MAT tile buffers — load a wide [HEAD, Tile_S1] tile and subview it as two [HEAD, Cube_S1] rhs operands for two matmuls. ptoas 0.32 removed the MLIR verifier restriction on these subviews, but the tmatmul lowering still emits incorrect DMA descriptors for strided RIGHT operands at runtime (separate bug report filed). Even if the codegen were fixed, this approach still loads the full wide tile (same DMA cost), cannot double-buffer K, and measured at 0.94× vs 0.97× for the current wide single-matmul path — net negative at our shapes.

Additional context

  • Reference file: fa_performance_kernel.cpp, template runTFA<>, function compute_qk.
  • The hardware TMPipe already supports all three controls via setAllocateStatus / setRecordStatus / setEntryOffset before each TPUSH/TPOP. This is a MLIR API surface gap only — no new compiler pass or hardware feature is needed.
  • The same entry_offset pattern applies to all three FIFO directions in the FA pipeline: QK cube→vec push (column offset by sub_tile_id), P vec→cube push (column offset by row_offset), and PV cube→vec push (always entry_offset=0, included for API consistency).
  • This is the primary blocking item for porting fa_performance_kernel.cpp to the DSL at its native shape (HEAD=128, Cube_S1=128, Tile_S1=256, QK_PRELOAD=4), which cannot be expressed at all today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions