Add profiler record_function annotation to HybridW4A16MoEExperts decode#896
Add profiler record_function annotation to HybridW4A16MoEExperts decode#896roberteg16 wants to merge 3 commits intomatthias.moe-perf-optfrom
Conversation
Adds fused_moe_wvSplitK_int4_gemm that dispatches expert blocks via blockIdx.y on-device, eliminating host-side loops and GPU-CPU sync. Weights are in skinny layout [E, N, K//8] int32 (ExLlama shuffle). Key optimizations for RDNA 3.5 decode (batch=1): - Use all CUs per expert block for maximum bandwidth - YTILE=2 for N=1 decode (better occupancy than YTILE=1 or 4) - Reduced LDS allocation (16KB vs 64KB) for higher occupancy - Non-temporal weight loads to avoid L1 pollution - Scattered mode with sorted_token_ids for decode without pre-permutation Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Dispatch MoE INT4 GEMM based on batch size: Triton for prefill (M>5), HIP wvSplitK for decode (M<=5). Both read from the same shuffle-packed [E, N, K//8] int32 weights — no duplication. The Triton path adds use_shuffle_w4a16 to fused_moe_kernel_gptq_awq which unpacks ExLlama-shuffled int32 via tl.interleave, then extracts nibbles with shift+mask. Scales are [E, N, K//G], symmetric only. Weight processing converts GPTQ [E, K/8, N] to skinny [E, N, K//8] with ExLlama shuffle packing at load time. Enabled by default on ROCm via VLLM_MOE_HYBRID_W4A16=true. Qwen3-Omni-30B-A3B AWQ on Strix Halo (vs exllama baseline): TPOT: 14.51ms → 13.73ms (-5.4%) TTFT: 996ms → 841ms (-15.6%) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Wrap the two fused_moe_wvSplitK_int4_gemm calls on the HIP decode path with record_function_or_nullcontext so profiler traces attribute time to named events of the form fused_moe_wvsplitk_int4 MxNxK E=<num_experts> top_k=<k> matching the convention used by the GPTQ/AWQ and conch MoE paths. Made-with: Cursor
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
d4ed7ea to
39132b5
Compare
39132b5 to
f898e10
Compare
Summary
Wrap the two
fused_moe_wvSplitK_int4_gemmcalls on the HIP decode path ofHybridW4A16MoEExpertswithrecord_function_or_nullcontextso profilertraces attribute time to named events of the form:
This matches the convention already used by the GPTQ/AWQ and conch MoE
paths (e.g.
fused_moe_gptq_awq MxNxK E=... top_k=...), and lets therocm-scripts profile-bandwidth summary (and other downstream regex-based
tooling) pick up per-shape stats for the int4 wvSplitK MoE decode path.
No functional change; only profiler annotations are added.
Test plan
--profile/torch_profiler_record_shapes=True;confirmed the profiler table now shows
fused_moe_wvsplitk_int4 MxNxK E=... top_k=...entries for GEMM1and GEMM2 of the HIP decode path.
Made with Cursor