Skip to content

fix: DeepEP env var crash + profiling support for xPyD (on top of PR#150)#151

Open
raviguptaamd wants to merge 6 commits intoROCm:developfrom
raviguptaamd:ravgupta/deepseek-r1-xpyd-fixes
Open

fix: DeepEP env var crash + profiling support for xPyD (on top of PR#150)#151
raviguptaamd wants to merge 6 commits intoROCm:developfrom
raviguptaamd:ravgupta/deepseek-r1-xpyd-fixes

Conversation

@raviguptaamd
Copy link
Copy Markdown
Contributor

Summary

Fixes and enhancements to the xPyD disaggregated inference scripts (PR #150) discovered during DeepSeek-R1 benchmark testing across MoRI EP and DeepEP backends on MI300X:

  • Fix DeepEP crash: MoRI-specific env vars (VLLM_MORIIO_QP_PER_TRANSFER, etc.) were unconditionally passed to Docker, causing ValueError in DeepEP jobs when empty. Now conditionally passed only if set.
  • Add profiling support: New RUN_PROFILE=1 mode that captures torch.profiler traces from all nodes (prefill + decode masters) during a sample inference request.
  • Make gpu-memory-utilization configurable for DeepEP (was hardcoded to 0.8).
  • Export decode master addresses to benchmark subprocess for correct multi-node profiling.

Benchmark Results (DeepSeek-R1 FP8, ISL=1024, OSL=1024)

Config Backend Output tok/s @ con=64 Mean TPOT (ms) Mean TTFT (ms)
2P/2D MoRI EP 1,175 49.83 3,336
4P/4D MoRI EP 1,103 53.93 3,028
1P/1D MoRI EP 1,011 57.26 4,928
1P/1D DeepEP 240 254.77 11,169
2P/2D DeepEP 189 292.94 38,612
4P/4D DeepEP OOM

MoRI EP delivers 4-6x higher throughput and 3-6x lower latency vs DeepEP.

Full report: ~/test_META_DS/results/DeepSeek-R1_xPyD_Benchmark_Report.md

Test Plan

  • MoRI EP 1P/1D benchmark (con=8,16,32,64)
  • MoRI EP 2P/2D benchmark
  • MoRI EP 4P/4D benchmark
  • DeepEP 1P/1D benchmark
  • DeepEP 2P/2D benchmark
  • DeepEP 4P/4D — OOM, not viable
  • Profiling trace capture (in progress)

Made with Cursor

Enable multi-node xP/yD disaggregated prefill/decode inference with
MoRI EP, proven on OCI MI300X (jobs 18044-18408) and AAC MI355X clusters.

Key changes:
- Remove 1P/1D restriction: support arbitrary xP/yD topologies
- Fix 6: --kv-transfer-config only on master nodes; child nodes join
  via --headless (prevents spurious proxy registration from headless workers)
- Add apply_moriio_2pd_patches.sh: downloads and applies vLLM PR #39276
  at container startup for engine_id collision fix + MoRIIO robustness
- Add RDMA/NCCL/MoRI env var passthrough to Docker containers
- Add host-local compilation caches (/tmp/vllm_cache) to avoid NFS races
- Add --ulimit memlock=-1:-1 for large RDMA memory registrations
- Add auto-discovery of host RDMA provider libs (mlx5, ionic, bnxt)
- Add stall detection with configurable per-step timeout in benchmark
- Add PyTorch default_pg_timeout patch (30min -> configurable, default 2h)

Proven config: --enforce-eager, moriio_toy_proxy_server.py (co-located),
warmup ISL=32/OSL=32 con=1, PR #39276 applied at runtime.

Docker image: rocm/pytorch-private:20260407_itej89_vllm_mori_docker
Depends on: vllm-project/vllm#39276 (applied at runtime)

Made-with: Cursor
Decode nodes can now optionally use CUDA graphs via VLLM_CUDAGRAPH_MODE
env var (e.g. FULL_DECODE_ONLY) while prefill nodes always run eager.
This captures CUDA graphs for the autoregressive decode phase only,
reducing per-token dispatch overhead while preserving eager flexibility
for prefill and MoRI EP all-to-all.

Usage: VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY sbatch ... run_xPyD_models.slurm

Default behavior (no env var set) remains --enforce-eager on all nodes.

Made-with: Cursor
MoRI v1.1.0+ requires a valid interface name for shmem bootstrap.
Setting a sensible default prevents empty-string failures on OCI
clusters where eth0 is the management NIC.

Made-with: Cursor
- Fix set -e abort in apply_moriio_2pd_patches.sh: move python3
  fallback out of for-loop word expansion to prevent script abort
  when vLLM is not importable
- Fail fast on patch failure for multi-node DP (xP>1 or yD>1):
  patches are mandatory for multi-node, optional for 1P/1D
- Fix timeout exit code in benchmark_xPyD.sh: use PIPESTATUS[0]
  instead of $? to capture timeout's exit code through the pipe
- Restore PROXY_TYPE, ROUTER_PORT, BENCHMARK_PORT passthrough to
  Docker container for Default/DeepEP mode compatibility
- Revert barrier port cleanup to hardcoded defaults (5000, 2222,
  15000) to stay aligned with in-container scripts

Made-with: Cursor
Patch download and verification failures now exit non-zero so that
multi-node DP runs abort early instead of proceeding unpatched.

Made-with: Cursor
Fixes and enhancements to the disaggregated inference scripts discovered
during DeepSeek-R1 xPyD benchmark testing (MoRI EP and DeepEP):

- run_xPyD_models.slurm: Make MoRI-specific env vars (MORI_RDMA_DEVICES,
  VLLM_MORIIO_QP_PER_TRANSFER, etc.) conditional — only pass to Docker
  if set. Fixes ValueError crash in DeepEP jobs from empty values.
  Add RUN_PROFILE and PROFILE_PORT propagation into containers.

- benchmark_xPyD.sh: Add optional profiling phase (RUN_PROFILE=1) that
  sends start_profile/stop_profile to prefill and decode master nodes
  with a sample inference request for trace capture.

- vllm_disagg_mori_ep.sh: Add --profiler-config to all vllm serve
  instances when RUN_PROFILE=1. Export DECODE_MASTER_ADDR and SERVE_PORT
  for the benchmark subprocess.

- vllm_disagg_server_deepep.sh: Same profiler additions. Make
  gpu-memory-utilization configurable via GPU_MEMORY_UTILIZATION env var.
  Export DECODE_MASTER_IP and SERVER_PORT for the benchmark subprocess.

Made-with: Cursor
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the xPyD disaggregated inference Slurm + launcher scripts to improve backend compatibility (DeepEP vs MoRI EP) and add optional torch-profiler trace capture during benchmark runs.

Changes:

  • Add RUN_PROFILE=1 mode that enables --profiler-config on servers and triggers /start_profile + /stop_profile around a sample request.
  • Avoid DeepEP crashes by conditionally passing MoRI-specific env vars into Docker only when set; make GPU memory utilization configurable.
  • Improve multi-node MoRI EP support: configurable ports, configurable GPUs-per-node sizing, runtime application of vLLM PR#39276 patch, and better cache/ulimit defaults.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/vllm_dissag/vllm_disagg_server_deepep.sh Adds profiler args, makes GPU memory utilization configurable, exports decode/server info to benchmark for profiling.
scripts/vllm_dissag/vllm_disagg_mori_ep.sh Enables multi-node topology, adds runtime patching + profiling support, makes ports/GPUs-per-node/config knobs configurable.
scripts/vllm_dissag/run_xPyD_models.slurm Adds host-local cache mount + memlock ulimit, expands RDMA lib mounting, and conditionally forwards MoRI env vars.
scripts/vllm_dissag/benchmark_xPyD.sh Parameterizes warmup, adds per-step timeout scaling + stall logging, and adds a profiling phase that hits profiling endpoints.
scripts/vllm_dissag/apply_moriio_2pd_patches.sh New helper to download/apply/verify vLLM PR#39276 patch at container startup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +15 to +18
PR_NUM=39276
PATCH_URL="https://github.com/vllm-project/vllm/pull/${PR_NUM}.patch"
PATCH_FILE="/tmp/vllm_pr_${PR_NUM}.patch"

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script downloads and applies the patch from a live GitHub PR URL at runtime. Because PR patch content can change over time (force-push/new commits) and requires outbound network access, startup becomes non-reproducible and can fail in restricted clusters. Consider vendoring the patch into the repo or pinning to a specific commit/tag patch URL (and optionally verifying a checksum) so the applied changes are deterministic.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +50
_GPUS_PER_NODE="${GPUS_PER_NODE:-8}"
PREFILL_DP_SIZE=$((xP * _GPUS_PER_NODE))
DECODE_DP_SIZE=$((yD * _GPUS_PER_NODE))
DP_PARALLEL_SIZE_LOCAL=${_GPUS_PER_NODE}
PREFILL_DP_START_RANK=$(( NODE_RANK * _GPUS_PER_NODE ))
PREFILL_MASTER_ADDR=$(echo "$IPADDRS" | awk -F',' '{print $1}')
DECODE_DP_START_RANK=$(( (NODE_RANK - xP) * 8 ))
DECODE_DP_START_RANK=$(( (NODE_RANK - xP) * _GPUS_PER_NODE ))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUS_PER_NODE is treated as an integer in arithmetic expansion (dp sizing and start-rank). If it’s set to a non-integer value, bash arithmetic will error or produce incorrect ranks with hard-to-debug downstream failures. Add a small validation (e.g., numeric regex check) and fail fast with a clear error message (or fall back to 8) before using it in $(( ... )) expressions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants