fix: DeepEP env var crash + profiling support for xPyD (on top of PR#150)#151
fix: DeepEP env var crash + profiling support for xPyD (on top of PR#150)#151raviguptaamd wants to merge 6 commits intoROCm:developfrom
Conversation
Enable multi-node xP/yD disaggregated prefill/decode inference with MoRI EP, proven on OCI MI300X (jobs 18044-18408) and AAC MI355X clusters. Key changes: - Remove 1P/1D restriction: support arbitrary xP/yD topologies - Fix 6: --kv-transfer-config only on master nodes; child nodes join via --headless (prevents spurious proxy registration from headless workers) - Add apply_moriio_2pd_patches.sh: downloads and applies vLLM PR #39276 at container startup for engine_id collision fix + MoRIIO robustness - Add RDMA/NCCL/MoRI env var passthrough to Docker containers - Add host-local compilation caches (/tmp/vllm_cache) to avoid NFS races - Add --ulimit memlock=-1:-1 for large RDMA memory registrations - Add auto-discovery of host RDMA provider libs (mlx5, ionic, bnxt) - Add stall detection with configurable per-step timeout in benchmark - Add PyTorch default_pg_timeout patch (30min -> configurable, default 2h) Proven config: --enforce-eager, moriio_toy_proxy_server.py (co-located), warmup ISL=32/OSL=32 con=1, PR #39276 applied at runtime. Docker image: rocm/pytorch-private:20260407_itej89_vllm_mori_docker Depends on: vllm-project/vllm#39276 (applied at runtime) Made-with: Cursor
Decode nodes can now optionally use CUDA graphs via VLLM_CUDAGRAPH_MODE env var (e.g. FULL_DECODE_ONLY) while prefill nodes always run eager. This captures CUDA graphs for the autoregressive decode phase only, reducing per-token dispatch overhead while preserving eager flexibility for prefill and MoRI EP all-to-all. Usage: VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY sbatch ... run_xPyD_models.slurm Default behavior (no env var set) remains --enforce-eager on all nodes. Made-with: Cursor
MoRI v1.1.0+ requires a valid interface name for shmem bootstrap. Setting a sensible default prevents empty-string failures on OCI clusters where eth0 is the management NIC. Made-with: Cursor
- Fix set -e abort in apply_moriio_2pd_patches.sh: move python3 fallback out of for-loop word expansion to prevent script abort when vLLM is not importable - Fail fast on patch failure for multi-node DP (xP>1 or yD>1): patches are mandatory for multi-node, optional for 1P/1D - Fix timeout exit code in benchmark_xPyD.sh: use PIPESTATUS[0] instead of $? to capture timeout's exit code through the pipe - Restore PROXY_TYPE, ROUTER_PORT, BENCHMARK_PORT passthrough to Docker container for Default/DeepEP mode compatibility - Revert barrier port cleanup to hardcoded defaults (5000, 2222, 15000) to stay aligned with in-container scripts Made-with: Cursor
Patch download and verification failures now exit non-zero so that multi-node DP runs abort early instead of proceeding unpatched. Made-with: Cursor
Fixes and enhancements to the disaggregated inference scripts discovered during DeepSeek-R1 xPyD benchmark testing (MoRI EP and DeepEP): - run_xPyD_models.slurm: Make MoRI-specific env vars (MORI_RDMA_DEVICES, VLLM_MORIIO_QP_PER_TRANSFER, etc.) conditional — only pass to Docker if set. Fixes ValueError crash in DeepEP jobs from empty values. Add RUN_PROFILE and PROFILE_PORT propagation into containers. - benchmark_xPyD.sh: Add optional profiling phase (RUN_PROFILE=1) that sends start_profile/stop_profile to prefill and decode master nodes with a sample inference request for trace capture. - vllm_disagg_mori_ep.sh: Add --profiler-config to all vllm serve instances when RUN_PROFILE=1. Export DECODE_MASTER_ADDR and SERVE_PORT for the benchmark subprocess. - vllm_disagg_server_deepep.sh: Same profiler additions. Make gpu-memory-utilization configurable via GPU_MEMORY_UTILIZATION env var. Export DECODE_MASTER_IP and SERVER_PORT for the benchmark subprocess. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Updates the xPyD disaggregated inference Slurm + launcher scripts to improve backend compatibility (DeepEP vs MoRI EP) and add optional torch-profiler trace capture during benchmark runs.
Changes:
- Add
RUN_PROFILE=1mode that enables--profiler-configon servers and triggers/start_profile+/stop_profilearound a sample request. - Avoid DeepEP crashes by conditionally passing MoRI-specific env vars into Docker only when set; make GPU memory utilization configurable.
- Improve multi-node MoRI EP support: configurable ports, configurable GPUs-per-node sizing, runtime application of vLLM PR#39276 patch, and better cache/ulimit defaults.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/vllm_dissag/vllm_disagg_server_deepep.sh | Adds profiler args, makes GPU memory utilization configurable, exports decode/server info to benchmark for profiling. |
| scripts/vllm_dissag/vllm_disagg_mori_ep.sh | Enables multi-node topology, adds runtime patching + profiling support, makes ports/GPUs-per-node/config knobs configurable. |
| scripts/vllm_dissag/run_xPyD_models.slurm | Adds host-local cache mount + memlock ulimit, expands RDMA lib mounting, and conditionally forwards MoRI env vars. |
| scripts/vllm_dissag/benchmark_xPyD.sh | Parameterizes warmup, adds per-step timeout scaling + stall logging, and adds a profiling phase that hits profiling endpoints. |
| scripts/vllm_dissag/apply_moriio_2pd_patches.sh | New helper to download/apply/verify vLLM PR#39276 patch at container startup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| PR_NUM=39276 | ||
| PATCH_URL="https://github.com/vllm-project/vllm/pull/${PR_NUM}.patch" | ||
| PATCH_FILE="/tmp/vllm_pr_${PR_NUM}.patch" | ||
|
|
There was a problem hiding this comment.
This script downloads and applies the patch from a live GitHub PR URL at runtime. Because PR patch content can change over time (force-push/new commits) and requires outbound network access, startup becomes non-reproducible and can fail in restricted clusters. Consider vendoring the patch into the repo or pinning to a specific commit/tag patch URL (and optionally verifying a checksum) so the applied changes are deterministic.
| _GPUS_PER_NODE="${GPUS_PER_NODE:-8}" | ||
| PREFILL_DP_SIZE=$((xP * _GPUS_PER_NODE)) | ||
| DECODE_DP_SIZE=$((yD * _GPUS_PER_NODE)) | ||
| DP_PARALLEL_SIZE_LOCAL=${_GPUS_PER_NODE} | ||
| PREFILL_DP_START_RANK=$(( NODE_RANK * _GPUS_PER_NODE )) | ||
| PREFILL_MASTER_ADDR=$(echo "$IPADDRS" | awk -F',' '{print $1}') | ||
| DECODE_DP_START_RANK=$(( (NODE_RANK - xP) * 8 )) | ||
| DECODE_DP_START_RANK=$(( (NODE_RANK - xP) * _GPUS_PER_NODE )) |
There was a problem hiding this comment.
GPUS_PER_NODE is treated as an integer in arithmetic expansion (dp sizing and start-rank). If it’s set to a non-integer value, bash arithmetic will error or produce incorrect ranks with hard-to-debug downstream failures. Add a small validation (e.g., numeric regex check) and fail fast with a clear error message (or fall back to 8) before using it in $(( ... )) expressions.
Summary
Fixes and enhancements to the xPyD disaggregated inference scripts (PR #150) discovered during DeepSeek-R1 benchmark testing across MoRI EP and DeepEP backends on MI300X:
VLLM_MORIIO_QP_PER_TRANSFER, etc.) were unconditionally passed to Docker, causingValueErrorin DeepEP jobs when empty. Now conditionally passed only if set.RUN_PROFILE=1mode that capturestorch.profilertraces from all nodes (prefill + decode masters) during a sample inference request.gpu-memory-utilizationconfigurable for DeepEP (was hardcoded to 0.8).Benchmark Results (DeepSeek-R1 FP8, ISL=1024, OSL=1024)
MoRI EP delivers 4-6x higher throughput and 3-6x lower latency vs DeepEP.
Full report:
~/test_META_DS/results/DeepSeek-R1_xPyD_Benchmark_Report.mdTest Plan
Made with Cursor