[WIP] Fix gm-pipe in spmd#1224
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces SPMD (Single Program Multiple Data) support for GM-pipe buffer tensor operations. It adds infrastructure to scale buffer allocation and unpacking based on SPMD core counts, extends the compiler's IR passes to detect and handle disjoint workspace sizing, implements a new codegen API for runtime tensor scaling, and includes example models and tests demonstrating the functionality. Changes
Sequence DiagramsequenceDiagram
participant Orchestration as Orchestration Layer
participant Codegen as Codegen (IR→C++)
participant IRPass as IR Pass
participant Backend as Backend Unpacking
participant Runtime as Runtime Execution
Orchestration->>IRPass: detect gm_pipe_buffer<br/>tensor allocation
IRPass->>IRPass: resolve SPMD core_num<br/>from function attrs
IRPass->>IRPass: scale buffer bytes<br/>by core_num
IRPass->>Codegen: inject scaled<br/>gm_pipe_buffer param
Codegen->>Codegen: cache core_num<br/>as scale expression
Codegen->>Codegen: emit tensor.create<br/>with scaled shape
Codegen->>Backend: generate wrapper code<br/>with SPMD unpacking
Backend->>Backend: compute GM pointer offset<br/>= buffer.addr + offset<br/>* block_idx / block_num
Backend->>Runtime: emit C++ kernel wrapper
Runtime->>Runtime: execute SPMD kernel<br/>over multiple cores
Runtime->>Runtime: each core accesses<br/>scaled buffer slice
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (1 warning, 2 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces support for sharding the gm_pipe_buffer in SPMD mode to prevent memory overlap between logical blocks. Key changes include updating the InjectGMSlotBufferInPlace IR pass to scale buffer sizes by the number of SPMD cores and modifying the Python backend to generate C++ code that offsets the buffer based on the block index. Additionally, a new example and a system test were added to verify this behavior. Feedback indicates that the buffer scaling logic should be moved from the IR transformation pass to the codegen phase to align with project architectural rules.
| int64_t spmd_core_num = ResolveSpmdCoreNumFromCallers(needs_gm_param, callers, func_by_name); | ||
| if (spmd_core_num > 1) { | ||
| gm_buffer_bytes *= spmd_core_num; | ||
| } | ||
| int64_t gm_buffer_elems = (gm_buffer_bytes + 3) / 4; |
There was a problem hiding this comment.
The logic for scaling gm_buffer_bytes by spmd_core_num and calculating gm_buffer_elems should be removed from this IR transformation pass. While the current implementation attempts to address alignment issues, repository rules specify that the __gm_pipe_buffer parameter is wired to hardware allocations during codegen, not in the IR transformation pass. This pass should only be responsible for adding the parameter to function signatures.
References
- The __gm_pipe_buffer parameter is wired to hardware allocations during codegen, not in the IR transformation pass. The pass only needs to add the parameter to function signatures.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/pypto/backend/pto_backend.py (1)
894-1001:⚠️ Potential issue | 🔴 CriticalDistributed execution will fail: codegen no longer emits required orchestration and task artifacts.
The
generate()function no longer producesorchestration/host_orch.pyandnext_levels/{name}/directories thatdistributed_runner.pyrequires. WhenDistributedCompiledProgram.__call__()invokesexecute_distributed(), it will fail withFileNotFoundErrorat line 140 and line 128 ofdistributed_runner.py.Current state:
generate()emits:orchestration/{name}.cpp(C++ only),kernel_config.pydistributed_runner.pyexpects:orchestration/host_orch.py(Python module),next_levels/{name}/(per-function directories)Any L3+ distributed program will crash at runtime. Either:
- Restore Python orchestration codegen and next_levels artifact emission in
generate(), OR- Update
distributed_runner.pyandDistributedCompiledProgramto work without these artifacts🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/pypto/backend/pto_backend.py` around lines 894 - 1001, The generate() function no longer emits the Python orchestration module and per-function next_levels artifacts required by distributed execution; restore emission inside generate (after orchestration_codegen succeeds) to produce orchestration/host_orch.py (containing the Python orchestration entrypoint used by DistributedCompiledProgram) and create next_levels/{func_name}/ directories with the expected per-kernel artifacts (e.g., MLIR/.pto and any wrapper files) for each kernel in orch_func; update the block that currently writes orchestration/{orch_func.name}.cpp and kernel_config.py (and associated skip_ptoas logic) to also generate the host_orch.py content and create the next_levels folders (using orch_result.func_name_to_id/func_name_to_core_type and the existing units list to locate kernel artifacts) so distributed_runner.py and DistributedCompiledProgram can find the files they expect.
🧹 Nitpick comments (1)
tests/ut/codegen/test_orchestration_codegen.py (1)
2283-2329: Add a dynamic-core_numvariant for this regression test.This test only locks the ConstInt path (
pl.spmd(4)). Please add acore_numas scalarVarcase to guard the dynamic launch-spec path and catch ordering/hoisting regressions around injectedgm_pipe_buffercreates.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/ut/codegen/test_orchestration_codegen.py` around lines 2283 - 2329, Add a second sub-case in test_spmd_gm_pipe_buffer_tensor_create_scales_with_core_num that exercises the dynamic launch-spec path by using a scalar Var for core_num instead of the constant pl.spmd(4): create a scalar Var (e.g., core_num = pl.Var(1, dtype=pl.INT32) or equivalent API used in tests), call with with pl.spmd(core_num): inside SpmdGMPipeProgram.main, run the same passes and codegen, and assert the generated code contains the gm_pipe_buffer tensor.create shape scaled by the dynamic core_num (use a regex that matches multiplication by core_num rather than the literal 4). Ensure the new case references the same program/kernel (SpmdGMPipeProgram.kernel and .main) so it catches ordering/hoisting regressions for injected gm_pipe_buffer creates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/backend/common/pto_ops_common.cpp`:
- Around line 2213-2226: The current code force-replaces the unknown marker
"v_row=?, v_col=?" in result_type using rows_const/cols_const which can produce
incorrect static valid extents; instead, call
InferSubviewTileTypeComponents(...) to compute the proper tile-type components
(or leave the result_type dynamic) and use its output to update result_type only
when it indicates an explicit valid extent; update the branch that checks
result_type.empty() && valid_row.empty() to invoke
InferSubviewTileTypeComponents(result_type, /*other needed args*/) and apply its
returned valid_row/valid_col/type components rather than performing the string
replacement on result_type with rows_const/cols_const.
---
Outside diff comments:
In `@python/pypto/backend/pto_backend.py`:
- Around line 894-1001: The generate() function no longer emits the Python
orchestration module and per-function next_levels artifacts required by
distributed execution; restore emission inside generate (after
orchestration_codegen succeeds) to produce orchestration/host_orch.py
(containing the Python orchestration entrypoint used by
DistributedCompiledProgram) and create next_levels/{func_name}/ directories with
the expected per-kernel artifacts (e.g., MLIR/.pto and any wrapper files) for
each kernel in orch_func; update the block that currently writes
orchestration/{orch_func.name}.cpp and kernel_config.py (and associated
skip_ptoas logic) to also generate the host_orch.py content and create the
next_levels folders (using orch_result.func_name_to_id/func_name_to_core_type
and the existing units list to locate kernel artifacts) so distributed_runner.py
and DistributedCompiledProgram can find the files they expect.
---
Nitpick comments:
In `@tests/ut/codegen/test_orchestration_codegen.py`:
- Around line 2283-2329: Add a second sub-case in
test_spmd_gm_pipe_buffer_tensor_create_scales_with_core_num that exercises the
dynamic launch-spec path by using a scalar Var for core_num instead of the
constant pl.spmd(4): create a scalar Var (e.g., core_num = pl.Var(1,
dtype=pl.INT32) or equivalent API used in tests), call with with
pl.spmd(core_num): inside SpmdGMPipeProgram.main, run the same passes and
codegen, and assert the generated code contains the gm_pipe_buffer tensor.create
shape scaled by the dynamic core_num (use a regex that matches multiplication by
core_num rather than the literal 4). Ensure the new case references the same
program/kernel (SpmdGMPipeProgram.kernel and .main) so it catches
ordering/hoisting regressions for injected gm_pipe_buffer creates.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 896ec4d7-9cc1-4757-ba5e-b19f2304a115
📒 Files selected for processing (11)
.github/workflows/ci.ymlexamples/models/10_down_proj_residual_spmd_gm_pipe.pyexamples/models/__init__.pyinclude/pypto/codegen/codegen_base.hpython/pypto/backend/pto_backend.pysrc/backend/common/pto_ops_common.cppsrc/codegen/orchestration/orchestration_codegen.cppsrc/codegen/tensor_op_codegen.cppsrc/ir/transforms/inject_gm_pipe_buffer_pass.cpptests/st/runtime/test_spmd.pytests/ut/codegen/test_orchestration_codegen.py
✅ Files skipped from review due to trivial changes (1)
- examples/models/init.py
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/ci.yml
- examples/models/10_down_proj_residual_spmd_gm_pipe.py
12e017c to
fa52997
Compare
SPMD
|
- Add GetTensorCreateScaleExpr; orchestration scales injected gm_pipe_buffer tensor.create by launch core_num; pto_backend shards __gm_pipe_buffer by block_idx. - Compute gm_buffer_elems after upward propagation in inject_gm_pipe_buffer. - Orchestration: skip inner scalar IVs absent from wrapper signature; map tuple outputs when SPMD wrappers prefix auxiliary tuple fields. - Reject non-bijective return permutations in normalize_return_order. - Add ST golden and UT coverage for SPMD gm_pipe_buffer path.
No description provided.