[Feature] `pto.tpush` / `pto.tpop` — `entry_offset` parameter for sub-slot partial push/pop

### Summary

Add an optional `entry_offset: int = 0` (byte offset) parameter to `pto.tpush`, `pto.tpop`, and their legacy `tpush_to_aiv` / `tpop_from_aic` variants, plus corresponding `allocate` and `record` boolean controls. This exposes the hardware's `setEntryOffset` / `setAllocateStatus` / `setRecordStatus` capabilities already used in [`fa_performance_kernel.cpp`](https://gitcode.com/cann/pto-isa/blob/master/kernels/manual/common/flash_atten/fa_performance_kernel.cpp), allowing a single FIFO slot to be filled and consumed in multiple narrow sub-tile writes rather than one full-width write.

### Motivation / use case

`fa_performance_kernel.cpp` splits each logical K tile (`[HEAD, Tile_S1]`) into `kTileFactor = Tile_S1 / Cube_S1` narrow sub-matmuls. Each sub-matmul operates on a `[HEAD, Cube_S1]` K tile and writes a `[Cube_S0, Cube_S1]` QK result into a different column-offset of the same `[Cube_S0, Tile_S1]` FIFO slot:

```cpp
// kTileFactor = 2, Cube_S1 = 128, Tile_S1 = 256
for (int sub_tile = 0; sub_tile < kTileFactor; sub_tile++) {
    GlobalDataK kGlobal(k + s1_index * HEAD_SIZE);   // narrow [HEAD, Cube_S1] load
    TLOAD(kMatTile[pingpong], kGlobal);
    pto_macro_matmul<Cube_S0, HEAD, Cube_S1>(..., AccMode::InitFinalSum);

    qkPipe.prod.setAllocateStatus(sub_tile == 0 && need_backpressure);
    qkPipe.prod.setRecordStatus(sub_tile == kTileFactor - 1);
    qkPipe.prod.setEntryOffset(sub_tile * Cube_S0 * Cube_S1 * sizeof(float));
    TPUSH(qkAccTile, qkPipe);
}
```

This pattern enables three compounding performance benefits that are impossible to express in the current DSL:

1. **Smaller tile working set** — `[HEAD, Cube_S1]` L0B tiles fit a larger `Cube_K` in `pto_macro_matmul`, reducing TEXTRACT/TMATMUL segment count. At `HEAD=128, Cube_S1=128`: `Cube_K=128`, `kSegments=1` vs `Cube_K=32`, `kSegments=4` for a wide `[HEAD, 512]` tile.
2. **K tile double-buffering** — narrow K tiles (`kMatTNBuffers=2`) let MTE2 load of sub-tile `i+1` overlap with the M matmul of sub-tile `i`. Not possible with a single wide tile.
3. **Smaller ACC footprint** — the ACC tile is `[Cube_S0, Cube_S1]` instead of `[Cube_S0, Tile_S1]`, enabling ACC ping-pong between two L0C addresses.

Without this primitive, the DSL is equivalent to `kTileFactor=1` (one full-width matmul per tile), which is measurably slower at the reference shape `(HEAD=128, Cube_S1=128, Tile_S1=256, QK_PRELOAD=4)`.


### Proposed API / behavior

MLIR op level — add attributes to `pto.tpush` and `pto.tpop`:

```mlir
pto.tpush %qk_acc, %pipe { split = 1 : i8, entry_offset = 0 : i64,
                             allocate = true, record = false }
pto.tpush %qk_acc, %pipe { split = 1 : i8, entry_offset = 32768 : i64,
                             allocate = false, record = true }
```

All three new parameters default to their current implicit values (`entry_offset=0`, `allocate=true`, `record=true`) so all existing call sites are unaffected.

### Alternatives considered

**Column subviews on RIGHT/MAT tile buffers** — load a wide `[HEAD, Tile_S1]` tile and subview it as two `[HEAD, Cube_S1]` rhs operands for two matmuls. ptoas 0.32 removed the MLIR verifier restriction on these subviews, but the `tmatmul` lowering still emits incorrect DMA descriptors for strided RIGHT operands at runtime (separate bug report filed). Even if the codegen were fixed, this approach still loads the full wide tile (same DMA cost), cannot double-buffer K, and measured at **0.94×** vs **0.97×** for the current wide single-matmul path — net negative at our shapes.

### Additional context

- Reference file: `fa_performance_kernel.cpp`, template `runTFA<>`, function `compute_qk`.
- The hardware `TMPipe` already supports all three controls via `setAllocateStatus` / `setRecordStatus` / `setEntryOffset` before each `TPUSH`/`TPOP`. This is a MLIR API surface gap only — no new compiler pass or hardware feature is needed.
- The same `entry_offset` pattern applies to all three FIFO directions in the FA pipeline: QK cube→vec push (column offset by `sub_tile_id`), P vec→cube push (column offset by `row_offset`), and PV cube→vec push (always `entry_offset=0`, included for API consistency).
- This is the primary blocking item for porting `fa_performance_kernel.cpp` to the DSL at its native shape `(HEAD=128, Cube_S1=128, Tile_S1=256, QK_PRELOAD=4)`, which cannot be expressed at all today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] `pto.tpush` / `pto.tpop` — `entry_offset` parameter for sub-slot partial push/pop #587

Summary

Motivation / use case

Proposed API / behavior

Alternatives considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] pto.tpush / pto.tpop — entry_offset parameter for sub-slot partial push/pop #587

Description

Summary

Motivation / use case

Proposed API / behavior

Alternatives considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature] `pto.tpush` / `pto.tpop` — `entry_offset` parameter for sub-slot partial push/pop #587