fix: DeepEP env var crash + profiling support for xPyD (on top of PR#150) by raviguptaamd · Pull Request #151 · ROCm/MAD

raviguptaamd · 2026-04-22T07:57:38Z

Summary

Fixes and enhancements to the xPyD disaggregated inference scripts (PR #150) discovered during DeepSeek-R1 benchmark testing across MoRI EP and DeepEP backends on MI300X:

Fix DeepEP crash: MoRI-specific env vars (VLLM_MORIIO_QP_PER_TRANSFER, etc.) were unconditionally passed to Docker, causing ValueError in DeepEP jobs when empty. Now conditionally passed only if set.
Add profiling support: New RUN_PROFILE=1 mode that captures torch.profiler traces from all nodes (prefill + decode masters) during a sample inference request.
Make gpu-memory-utilization configurable for DeepEP (was hardcoded to 0.8).
Export decode master addresses to benchmark subprocess for correct multi-node profiling.

Benchmark Results (DeepSeek-R1 FP8, ISL=1024, OSL=1024)

Config	Backend	Output tok/s @ con=64	Mean TPOT (ms)	Mean TTFT (ms)
2P/2D	MoRI EP	1,175	49.83	3,336
4P/4D	MoRI EP	1,103	53.93	3,028
1P/1D	MoRI EP	1,011	57.26	4,928
1P/1D	DeepEP	240	254.77	11,169
2P/2D	DeepEP	189	292.94	38,612
4P/4D	DeepEP	OOM	—	—

MoRI EP delivers 4-6x higher throughput and 3-6x lower latency vs DeepEP.

Full report: ~/test_META_DS/results/DeepSeek-R1_xPyD_Benchmark_Report.md

Test Plan

MoRI EP 1P/1D benchmark (con=8,16,32,64)
MoRI EP 2P/2D benchmark
MoRI EP 4P/4D benchmark
DeepEP 1P/1D benchmark
DeepEP 2P/2D benchmark
DeepEP 4P/4D — OOM, not viable
Profiling trace capture (in progress)

Made with Cursor

Enable multi-node xP/yD disaggregated prefill/decode inference with MoRI EP, proven on OCI MI300X (jobs 18044-18408) and AAC MI355X clusters. Key changes: - Remove 1P/1D restriction: support arbitrary xP/yD topologies - Fix 6: --kv-transfer-config only on master nodes; child nodes join via --headless (prevents spurious proxy registration from headless workers) - Add apply_moriio_2pd_patches.sh: downloads and applies vLLM PR #39276 at container startup for engine_id collision fix + MoRIIO robustness - Add RDMA/NCCL/MoRI env var passthrough to Docker containers - Add host-local compilation caches (/tmp/vllm_cache) to avoid NFS races - Add --ulimit memlock=-1:-1 for large RDMA memory registrations - Add auto-discovery of host RDMA provider libs (mlx5, ionic, bnxt) - Add stall detection with configurable per-step timeout in benchmark - Add PyTorch default_pg_timeout patch (30min -> configurable, default 2h) Proven config: --enforce-eager, moriio_toy_proxy_server.py (co-located), warmup ISL=32/OSL=32 con=1, PR #39276 applied at runtime. Docker image: rocm/pytorch-private:20260407_itej89_vllm_mori_docker Depends on: vllm-project/vllm#39276 (applied at runtime) Made-with: Cursor

Decode nodes can now optionally use CUDA graphs via VLLM_CUDAGRAPH_MODE env var (e.g. FULL_DECODE_ONLY) while prefill nodes always run eager. This captures CUDA graphs for the autoregressive decode phase only, reducing per-token dispatch overhead while preserving eager flexibility for prefill and MoRI EP all-to-all. Usage: VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY sbatch ... run_xPyD_models.slurm Default behavior (no env var set) remains --enforce-eager on all nodes. Made-with: Cursor

MoRI v1.1.0+ requires a valid interface name for shmem bootstrap. Setting a sensible default prevents empty-string failures on OCI clusters where eth0 is the management NIC. Made-with: Cursor

- Fix set -e abort in apply_moriio_2pd_patches.sh: move python3 fallback out of for-loop word expansion to prevent script abort when vLLM is not importable - Fail fast on patch failure for multi-node DP (xP>1 or yD>1): patches are mandatory for multi-node, optional for 1P/1D - Fix timeout exit code in benchmark_xPyD.sh: use PIPESTATUS[0] instead of $? to capture timeout's exit code through the pipe - Restore PROXY_TYPE, ROUTER_PORT, BENCHMARK_PORT passthrough to Docker container for Default/DeepEP mode compatibility - Revert barrier port cleanup to hardcoded defaults (5000, 2222, 15000) to stay aligned with in-container scripts Made-with: Cursor

Patch download and verification failures now exit non-zero so that multi-node DP runs abort early instead of proceeding unpatched. Made-with: Cursor

Fixes and enhancements to the disaggregated inference scripts discovered during DeepSeek-R1 xPyD benchmark testing (MoRI EP and DeepEP): - run_xPyD_models.slurm: Make MoRI-specific env vars (MORI_RDMA_DEVICES, VLLM_MORIIO_QP_PER_TRANSFER, etc.) conditional — only pass to Docker if set. Fixes ValueError crash in DeepEP jobs from empty values. Add RUN_PROFILE and PROFILE_PORT propagation into containers. - benchmark_xPyD.sh: Add optional profiling phase (RUN_PROFILE=1) that sends start_profile/stop_profile to prefill and decode master nodes with a sample inference request for trace capture. - vllm_disagg_mori_ep.sh: Add --profiler-config to all vllm serve instances when RUN_PROFILE=1. Export DECODE_MASTER_ADDR and SERVE_PORT for the benchmark subprocess. - vllm_disagg_server_deepep.sh: Same profiler additions. Make gpu-memory-utilization configurable via GPU_MEMORY_UTILIZATION env var. Export DECODE_MASTER_IP and SERVER_PORT for the benchmark subprocess. Made-with: Cursor

Copilot

Pull request overview

Updates the xPyD disaggregated inference Slurm + launcher scripts to improve backend compatibility (DeepEP vs MoRI EP) and add optional torch-profiler trace capture during benchmark runs.

Changes:

Add RUN_PROFILE=1 mode that enables --profiler-config on servers and triggers /start_profile + /stop_profile around a sample request.
Avoid DeepEP crashes by conditionally passing MoRI-specific env vars into Docker only when set; make GPU memory utilization configurable.
Improve multi-node MoRI EP support: configurable ports, configurable GPUs-per-node sizing, runtime application of vLLM PR#39276 patch, and better cache/ulimit defaults.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
scripts/vllm_dissag/vllm_disagg_server_deepep.sh	Adds profiler args, makes GPU memory utilization configurable, exports decode/server info to benchmark for profiling.
scripts/vllm_dissag/vllm_disagg_mori_ep.sh	Enables multi-node topology, adds runtime patching + profiling support, makes ports/GPUs-per-node/config knobs configurable.
scripts/vllm_dissag/run_xPyD_models.slurm	Adds host-local cache mount + memlock ulimit, expands RDMA lib mounting, and conditionally forwards MoRI env vars.
scripts/vllm_dissag/benchmark_xPyD.sh	Parameterizes warmup, adds per-step timeout scaling + stall logging, and adds a profiling phase that hits profiling endpoints.
scripts/vllm_dissag/apply_moriio_2pd_patches.sh	New helper to download/apply/verify vLLM PR#39276 patch at container startup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T08:04:20Z

+PR_NUM=39276
+PATCH_URL="https://github.com/vllm-project/vllm/pull/${PR_NUM}.patch"
+PATCH_FILE="/tmp/vllm_pr_${PR_NUM}.patch"
+


This script downloads and applies the patch from a live GitHub PR URL at runtime. Because PR patch content can change over time (force-push/new commits) and requires outbound network access, startup becomes non-reproducible and can fail in restricted clusters. Consider vendoring the patch into the repo or pinning to a specific commit/tag patch URL (and optionally verifying a checksum) so the applied changes are deterministic.

Copilot · 2026-04-22T08:04:20Z

+_GPUS_PER_NODE="${GPUS_PER_NODE:-8}"
+PREFILL_DP_SIZE=$((xP * _GPUS_PER_NODE))
+DECODE_DP_SIZE=$((yD * _GPUS_PER_NODE))
+DP_PARALLEL_SIZE_LOCAL=${_GPUS_PER_NODE}
+PREFILL_DP_START_RANK=$(( NODE_RANK * _GPUS_PER_NODE ))
 PREFILL_MASTER_ADDR=$(echo "$IPADDRS" | awk -F',' '{print $1}')
-DECODE_DP_START_RANK=$(( (NODE_RANK - xP) * 8 ))
+DECODE_DP_START_RANK=$(( (NODE_RANK - xP) * _GPUS_PER_NODE ))


GPUS_PER_NODE is treated as an integer in arithmetic expansion (dp sizing and start-rank). If it’s set to a non-integer value, bash arithmetic will error or produce incorrect ranks with hard-to-debug downstream failures. Add a small validation (e.g., numeric regex check) and fail fast with a clear error message (or fall back to 8) before using it in $(( ... )) expressions.

raviguptaamd added 6 commits April 19, 2026 07:53

Default MORI_SOCKET_IFNAME to eth0 for bootstrap communication

1c94375

MoRI v1.1.0+ requires a valid interface name for shmem bootstrap. Setting a sensible default prevents empty-string failures on OCI clusters where eth0 is the management NIC. Made-with: Cursor

fix(patches): fail fast on download/verification failures

8811c62

Patch download and verification failures now exit non-zero so that multi-node DP runs abort early instead of proceeding unpatched. Made-with: Cursor

Copilot AI review requested due to automatic review settings April 22, 2026 07:57

raviguptaamd requested review from Rohan138, amathews-amd, coketaste, gargrahul and ppalaniappan-amd as code owners April 22, 2026 07:57

Copilot started reviewing on behalf of raviguptaamd April 22, 2026 07:59 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: DeepEP env var crash + profiling support for xPyD (on top of PR#150)#151

fix: DeepEP env var crash + profiling support for xPyD (on top of PR#150)#151
raviguptaamd wants to merge 6 commits intoROCm:developfrom
raviguptaamd:ravgupta/deepseek-r1-xpyd-fixes

raviguptaamd commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raviguptaamd commented Apr 22, 2026

Summary

Benchmark Results (DeepSeek-R1 FP8, ISL=1024, OSL=1024)

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants