Skip to content

[WIP] Fix gm-pipe in spmd#1224

Open
yanghaoran29 wants to merge 1 commit intohw-native-sys:mainfrom
yanghaoran29:fix-spmd3
Open

[WIP] Fix gm-pipe in spmd#1224
yanghaoran29 wants to merge 1 commit intohw-native-sys:mainfrom
yanghaoran29:fix-spmd3

Conversation

@yanghaoran29
Copy link
Copy Markdown
Contributor

No description provided.

@yanghaoran29 yanghaoran29 changed the title Fix spmd3 [WIP] Fix gm-pipe in spmd Apr 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces SPMD (Single Program Multiple Data) support for GM-pipe buffer tensor operations. It adds infrastructure to scale buffer allocation and unpacking based on SPMD core counts, extends the compiler's IR passes to detect and handle disjoint workspace sizing, implements a new codegen API for runtime tensor scaling, and includes example models and tests demonstrating the functionality.

Changes

Cohort / File(s) Summary
CI and Test Configuration
.github/workflows/ci.yml
Updated pytest invocation to include new SPMD GM-pipe runtime test module (test_spmd_gm_pipe_buffer.py) alongside existing tests.
Example Models
examples/models/10_down_proj_residual_spmd_gm_pipe.py, examples/models/__init__.py
Added new executable example demonstrating down-projection with SPMD sharding and GM-pipe buffer verification; updated module aliases to expose down_proj_residual_spmd_gm_pipe and paged_attention_spmd.
Backend SPMD Support
python/pypto/backend/pto_backend.py
Modified argument unpacking for SPMD GM-pipe buffer to compute typed GM pointers with per-core offset scaling; removed distributed (L3+) codegen path.
IR Pass Enhancements
src/ir/transforms/inject_gm_pipe_buffer_pass.cpp
Updated buffer sizing logic to resolve effective SPMD core count and multiply gm_buffer_bytes by this factor when spmd_core_num > 1 for proper per-core allocation.
Codegen API
include/pypto/codegen/codegen_base.h
Added new virtual method GetTensorCreateScaleExpr() to codegen base class, enabling subclasses to provide per-tensor scale expressions for runtime allocation sizing.
Tile Operations
src/backend/common/pto_ops_common.cpp
Updated tile.slice lowering to derive result buffer type via new helper and embed compile-time valid-row/col values when inferred type contains unknown dimensions.
Orchestration Codegen
src/codegen/orchestration/orchestration_codegen.cpp, src/codegen/tensor_op_codegen.cpp
Implemented per-tensor scale expression caching in orchestration codegen (detecting SPMD function core_num from injected gm_pipe_buffer allocation sites); updated tensor-create codegen to apply scale factor to first-dimension shape expressions when available.
Runtime and Unit Tests
tests/st/runtime/test_spmd.py, tests/ut/codegen/test_orchestration_codegen.py
Added SPMD golden runtime test for matmul-with-bias kernel across GM blocks and unit test validating orchestration codegen block-num and scaled gm_pipe_buffer allocation matching SPMD core count.

Sequence Diagram

sequenceDiagram
    participant Orchestration as Orchestration Layer
    participant Codegen as Codegen (IR→C++)
    participant IRPass as IR Pass
    participant Backend as Backend Unpacking
    participant Runtime as Runtime Execution

    Orchestration->>IRPass: detect gm_pipe_buffer<br/>tensor allocation
    IRPass->>IRPass: resolve SPMD core_num<br/>from function attrs
    IRPass->>IRPass: scale buffer bytes<br/>by core_num
    IRPass->>Codegen: inject scaled<br/>gm_pipe_buffer param
    
    Codegen->>Codegen: cache core_num<br/>as scale expression
    Codegen->>Codegen: emit tensor.create<br/>with scaled shape
    Codegen->>Backend: generate wrapper code<br/>with SPMD unpacking
    
    Backend->>Backend: compute GM pointer offset<br/>= buffer.addr + offset<br/>* block_idx / block_num
    Backend->>Runtime: emit C++ kernel wrapper
    
    Runtime->>Runtime: execute SPMD kernel<br/>over multiple cores
    Runtime->>Runtime: each core accesses<br/>scaled buffer slice
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • lyfne123

Poem

🐰 Hoppy hops and buffers scaled,
SPMD cores no longer failed!
Per-tile pointers dance with grace,
Orchestration finds its place.
GM-pipe dreams now compile bright,
Tensors flow through sharded night!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and uses a non-descriptive prefix '[WIP]', making it unclear what the core fix involves despite mentioning 'gm-pipe' and 'spmd'. Replace '[WIP]' with a clearer, more specific title that describes the actual fix, such as 'Fix gm-pipe buffer scaling for SPMD partitioning' or similar.
Description check ❓ Inconclusive No pull request description was provided by the author. Add a detailed description explaining the problem being fixed, the solution approach, and how the changes achieve the stated objectives.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for sharding the gm_pipe_buffer in SPMD mode to prevent memory overlap between logical blocks. Key changes include updating the InjectGMSlotBufferInPlace IR pass to scale buffer sizes by the number of SPMD cores and modifying the Python backend to generate C++ code that offsets the buffer based on the block index. Additionally, a new example and a system test were added to verify this behavior. Feedback indicates that the buffer scaling logic should be moved from the IR transformation pass to the codegen phase to align with project architectural rules.

Comment on lines +475 to +479
int64_t spmd_core_num = ResolveSpmdCoreNumFromCallers(needs_gm_param, callers, func_by_name);
if (spmd_core_num > 1) {
gm_buffer_bytes *= spmd_core_num;
}
int64_t gm_buffer_elems = (gm_buffer_bytes + 3) / 4;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for scaling gm_buffer_bytes by spmd_core_num and calculating gm_buffer_elems should be removed from this IR transformation pass. While the current implementation attempts to address alignment issues, repository rules specify that the __gm_pipe_buffer parameter is wired to hardware allocations during codegen, not in the IR transformation pass. This pass should only be responsible for adding the parameter to function signatures.

References
  1. The __gm_pipe_buffer parameter is wired to hardware allocations during codegen, not in the IR transformation pass. The pass only needs to add the parameter to function signatures.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/pypto/backend/pto_backend.py (1)

894-1001: ⚠️ Potential issue | 🔴 Critical

Distributed execution will fail: codegen no longer emits required orchestration and task artifacts.

The generate() function no longer produces orchestration/host_orch.py and next_levels/{name}/ directories that distributed_runner.py requires. When DistributedCompiledProgram.__call__() invokes execute_distributed(), it will fail with FileNotFoundError at line 140 and line 128 of distributed_runner.py.

Current state:

  • generate() emits: orchestration/{name}.cpp (C++ only), kernel_config.py
  • distributed_runner.py expects: orchestration/host_orch.py (Python module), next_levels/{name}/ (per-function directories)

Any L3+ distributed program will crash at runtime. Either:

  1. Restore Python orchestration codegen and next_levels artifact emission in generate(), OR
  2. Update distributed_runner.py and DistributedCompiledProgram to work without these artifacts
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/backend/pto_backend.py` around lines 894 - 1001, The generate()
function no longer emits the Python orchestration module and per-function
next_levels artifacts required by distributed execution; restore emission inside
generate (after orchestration_codegen succeeds) to produce
orchestration/host_orch.py (containing the Python orchestration entrypoint used
by DistributedCompiledProgram) and create next_levels/{func_name}/ directories
with the expected per-kernel artifacts (e.g., MLIR/.pto and any wrapper files)
for each kernel in orch_func; update the block that currently writes
orchestration/{orch_func.name}.cpp and kernel_config.py (and associated
skip_ptoas logic) to also generate the host_orch.py content and create the
next_levels folders (using orch_result.func_name_to_id/func_name_to_core_type
and the existing units list to locate kernel artifacts) so distributed_runner.py
and DistributedCompiledProgram can find the files they expect.
🧹 Nitpick comments (1)
tests/ut/codegen/test_orchestration_codegen.py (1)

2283-2329: Add a dynamic-core_num variant for this regression test.

This test only locks the ConstInt path (pl.spmd(4)). Please add a core_num as scalar Var case to guard the dynamic launch-spec path and catch ordering/hoisting regressions around injected gm_pipe_buffer creates.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ut/codegen/test_orchestration_codegen.py` around lines 2283 - 2329, Add
a second sub-case in test_spmd_gm_pipe_buffer_tensor_create_scales_with_core_num
that exercises the dynamic launch-spec path by using a scalar Var for core_num
instead of the constant pl.spmd(4): create a scalar Var (e.g., core_num =
pl.Var(1, dtype=pl.INT32) or equivalent API used in tests), call with with
pl.spmd(core_num): inside SpmdGMPipeProgram.main, run the same passes and
codegen, and assert the generated code contains the gm_pipe_buffer tensor.create
shape scaled by the dynamic core_num (use a regex that matches multiplication by
core_num rather than the literal 4). Ensure the new case references the same
program/kernel (SpmdGMPipeProgram.kernel and .main) so it catches
ordering/hoisting regressions for injected gm_pipe_buffer creates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/common/pto_ops_common.cpp`:
- Around line 2213-2226: The current code force-replaces the unknown marker
"v_row=?, v_col=?" in result_type using rows_const/cols_const which can produce
incorrect static valid extents; instead, call
InferSubviewTileTypeComponents(...) to compute the proper tile-type components
(or leave the result_type dynamic) and use its output to update result_type only
when it indicates an explicit valid extent; update the branch that checks
result_type.empty() && valid_row.empty() to invoke
InferSubviewTileTypeComponents(result_type, /*other needed args*/) and apply its
returned valid_row/valid_col/type components rather than performing the string
replacement on result_type with rows_const/cols_const.

---

Outside diff comments:
In `@python/pypto/backend/pto_backend.py`:
- Around line 894-1001: The generate() function no longer emits the Python
orchestration module and per-function next_levels artifacts required by
distributed execution; restore emission inside generate (after
orchestration_codegen succeeds) to produce orchestration/host_orch.py
(containing the Python orchestration entrypoint used by
DistributedCompiledProgram) and create next_levels/{func_name}/ directories with
the expected per-kernel artifacts (e.g., MLIR/.pto and any wrapper files) for
each kernel in orch_func; update the block that currently writes
orchestration/{orch_func.name}.cpp and kernel_config.py (and associated
skip_ptoas logic) to also generate the host_orch.py content and create the
next_levels folders (using orch_result.func_name_to_id/func_name_to_core_type
and the existing units list to locate kernel artifacts) so distributed_runner.py
and DistributedCompiledProgram can find the files they expect.

---

Nitpick comments:
In `@tests/ut/codegen/test_orchestration_codegen.py`:
- Around line 2283-2329: Add a second sub-case in
test_spmd_gm_pipe_buffer_tensor_create_scales_with_core_num that exercises the
dynamic launch-spec path by using a scalar Var for core_num instead of the
constant pl.spmd(4): create a scalar Var (e.g., core_num = pl.Var(1,
dtype=pl.INT32) or equivalent API used in tests), call with with
pl.spmd(core_num): inside SpmdGMPipeProgram.main, run the same passes and
codegen, and assert the generated code contains the gm_pipe_buffer tensor.create
shape scaled by the dynamic core_num (use a regex that matches multiplication by
core_num rather than the literal 4). Ensure the new case references the same
program/kernel (SpmdGMPipeProgram.kernel and .main) so it catches
ordering/hoisting regressions for injected gm_pipe_buffer creates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 896ec4d7-9cc1-4757-ba5e-b19f2304a115

📥 Commits

Reviewing files that changed from the base of the PR and between eeb7178 and ccc2e7d.

📒 Files selected for processing (11)
  • .github/workflows/ci.yml
  • examples/models/10_down_proj_residual_spmd_gm_pipe.py
  • examples/models/__init__.py
  • include/pypto/codegen/codegen_base.h
  • python/pypto/backend/pto_backend.py
  • src/backend/common/pto_ops_common.cpp
  • src/codegen/orchestration/orchestration_codegen.cpp
  • src/codegen/tensor_op_codegen.cpp
  • src/ir/transforms/inject_gm_pipe_buffer_pass.cpp
  • tests/st/runtime/test_spmd.py
  • tests/ut/codegen/test_orchestration_codegen.py
✅ Files skipped from review due to trivial changes (1)
  • examples/models/init.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/ci.yml
  • examples/models/10_down_proj_residual_spmd_gm_pipe.py

Comment thread src/backend/common/pto_ops_common.cpp Outdated
@yanghaoran29 yanghaoran29 force-pushed the fix-spmd3 branch 2 times, most recently from 12e017c to fa52997 Compare April 30, 2026 03:23
@yanghaoran29
Copy link
Copy Markdown
Contributor Author

SPMD gm_pipe_buffer Overlap Analysis and Fix Rationale

Background

The following down-projection pattern uses pl.spmd(...) and mixed-kernel
matmul/matmul_acc in one orchestration scope:

# Stage 7 & 8: Down projection + final residual writeback.
for db0 in pl.spmd((HIDDEN // DOWN_N_CHUNK) // 2, name_hint="down_proj_residual_spmd"):
    db = db0 * 2
    for di in pl.range(db, db + 2):
        d0 = di * DOWN_N_CHUNK
        mlp_chunk_0 = mlp_tile[:, 0 : DOWN_K_CHUNK]
        w_down_chunk_0 = w_down[0 : DOWN_K_CHUNK, d0 : d0 + DOWN_N_CHUNK]
        resid1_tile_chunk = resid1_tile[:, d0 : d0 + DOWN_N_CHUNK]
        down_acc = pl.matmul(mlp_chunk_0, w_down_chunk_0, out_dtype=pl.FP32)
        mlp_chunk_1 = mlp_tile[:, DOWN_K_CHUNK : 2 * DOWN_K_CHUNK]
        w_down_chunk_1 = w_down[DOWN_K_CHUNK : 2 * DOWN_K_CHUNK, d0 : d0 + DOWN_N_CHUNK]
        down_acc = pl.matmul_acc(down_acc, mlp_chunk_1, w_down_chunk_1)
        for ob in pl.pipeline(2, INTERMEDIATE // DOWN_K_CHUNK, stage=2):
            o0 = ob * DOWN_K_CHUNK
            down_mlp_chunk = mlp_tile[:, o0 : o0 + DOWN_K_CHUNK]
            w_down_chunk = w_down[o0 : o0 + DOWN_K_CHUNK, d0 : d0 + DOWN_N_CHUNK]
            down_acc = pl.matmul_acc(down_acc, down_mlp_chunk, w_down_chunk)
        out_chunk = pl.add(down_acc, resid1_tile_chunk)
        out = pl.assemble(out, pl.cast(out_chunk, target_type=pl.BF16), [0, d0])

For this pattern, codegen injects and uses __gm_pipe_buffer. The failure mode
appears only in SPMD form, while equivalent non-SPMD form can pass.

Why non-SPMD can pass while SPMD fails

Key runtime behavior from Simpler source (with direct code excerpts)

  1. Allocation path: alloc_tensors enters runtime and performs real allocation

runtime/src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h:

static inline TaskOutputTensors alloc_tensors(const Arg &args) {
    PTO2Runtime *rt = pto2_current_runtime();
    if (rt->ops->is_fatal(rt)) {
        return TaskOutputTensors{};
    }
    return rt->ops->alloc_tensors(rt, args);
}

runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp:

static TaskOutputTensors alloc_tensors_impl(PTO2Runtime *rt, const Arg &args) {
    return pto2_alloc_tensors(&rt->orchestrator, args);
}

runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp:

TaskOutputTensors pto2_alloc_tensors(PTO2OrchestratorState *orch, const Arg &args) {
    // ...
    PTO2OutputLayout layout = pto2_calculate_output_layout(args);
    PTO2PreparedTask prepared;
    if (!pto2_prepare_task(orch, args, layout.total_output_size, 0, &prepared)) {
        return TaskOutputTensors{};
    }

    PTO2TaskDescriptor &task = *prepared.task;
    PTO2TaskPayload &payload = *prepared.payload;
    // ...
    payload.init(args, outputs, prepared.alloc_result, layout, false);
    // ...
}

This shows alloc_tensors is not a frontend-only abstraction: it reaches runtime
allocation code and binds allocation results into task payload.

  1. SPMD dispatch is one task split into many logical blocks

runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp:

void SchedulerContext::build_payload(
    PTO2DispatchPayload &dispatch_payload, PTO2TaskSlotState &slot_state, PTO2SubtaskSlot subslot,
    PTO2DeferredCompletionIngressBuffer *deferred_ingress
) {
    auto &payload = *slot_state.payload;
    int n = 0;
    for (int32_t i = 0; i < payload.tensor_count; i++) {
        dispatch_payload.args[n++] = reinterpret_cast<uint64_t>(&payload.tensors[i]);
    }
    // ...
    dispatch_payload.local_context.block_idx = slot_state.next_block_idx;
    dispatch_payload.local_context.block_num = slot_state.logical_block_num;
    // ...
}

Tensor argument addresses are reused for the task, while each dispatch updates
block_idx/block_num.

  1. Kernel-side block identity is read from per-dispatch local context

runtime/src/a2a3/runtime/tensormap_and_ringbuffer/common/intrinsic.h:

struct LocalContext {
    int32_t block_idx;  // Logical block index within the task [0, block_num)
    int32_t block_num;  // How many logical blocks this task requires.
    // ...
};

static __aicore__ inline int32_t get_block_idx(__gm__ int64_t *args) {
    __gm__ LocalContext *ctx =
        reinterpret_cast<__gm__ LocalContext *>(static_cast<uint64_t>(args[SPMD_LOCAL_CONTEXT_INDEX]));
    return ctx->block_idx;
}

Different logical blocks therefore share one task-level argument layout but run
with different block_idx values from local context.

Consequence

  • In non-SPMD form (block_num=1), only one logical block uses the workspace at
    a time, so overlap does not happen.
  • In SPMD form (block_num>1), if __gm_pipe_buffer is not both:
    • allocated with enough total capacity for all logical blocks, and
    • indexed to a per-block slice in kernel wrapper,
      then multiple logical blocks will read/write overlapping workspace regions.

What this change set fixes

  1. Allocation-side scaling (host/orchestration side)
    For injected gm pipe tensor.create, allocation shape is scaled by launch
    core_num during orchestration codegen (instead of changing IR visible shape).

  2. Wrapper-side sharding (kernel arg unpacking side)
    For SPMD kernels, generated wrapper maps __gm_pipe_buffer pointer to:
    base + block_idx * elems_per_block.
    This guarantees each logical block gets a disjoint workspace slice.

  3. Sample alignment with block-level sharding contract
    The added ST sample uses SPMD by block index and does not combine
    UP_DOWN split for this gm pipe case, because current sharding contract is
    block-level (not lane-level subblock sharding).

Evidence from the added sample

Sample test:
tests/st/runtime/test_spmd.py::TestSPMDOperations::test_spmd_gm_pipe_buffer_golden

  • Before latest sample correction (historical run):
    • result: FAIL
    • symptom: golden mismatch (1024/2048 mismatched elements)
  • After latest commit and reinstall (current run):
    • result: PASS
    • command mode: task-submit + pytest ST

This delta is consistent with the overlap hypothesis and confirms the necessity
of:

  • per-block capacity guarantee at allocation time, and
  • per-block pointer sharding at kernel entry,
  • plus sample-side execution pattern aligned to current sharding semantics.

- Add GetTensorCreateScaleExpr; orchestration scales injected gm_pipe_buffer
  tensor.create by launch core_num; pto_backend shards __gm_pipe_buffer by block_idx.
- Compute gm_buffer_elems after upward propagation in inject_gm_pipe_buffer.
- Orchestration: skip inner scalar IVs absent from wrapper signature; map tuple
  outputs when SPMD wrappers prefix auxiliary tuple fields.
- Reject non-bijective return permutations in normalize_return_order.
- Add ST golden and UT coverage for SPMD gm_pipe_buffer path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant