Add profiler record_function annotation to HybridW4A16MoEExperts decode by roberteg16 · Pull Request #896 · ROCm/vllm

roberteg16 · 2026-04-24T17:00:44Z

Summary

Wrap the two fused_moe_wvSplitK_int4_gemm calls on the HIP decode path of
HybridW4A16MoEExperts with record_function_or_nullcontext so profiler
traces attribute time to named events of the form:

fused_moe_wvsplitk_int4 MxNxK E=<num_experts> top_k=<k>

This matches the convention already used by the GPTQ/AWQ and conch MoE
paths (e.g. fused_moe_gptq_awq MxNxK E=... top_k=...), and lets the
rocm-scripts profile-bandwidth summary (and other downstream regex-based
tooling) pick up per-shape stats for the int4 wvSplitK MoE decode path.

No functional change; only profiler annotations are added.

Test plan

Ran vLLM with --profile/torch_profiler_record_shapes=True;
confirmed the profiler table now shows
fused_moe_wvsplitk_int4 MxNxK E=... top_k=... entries for GEMM1
and GEMM2 of the HIP decode path.

Made with Cursor

Adds fused_moe_wvSplitK_int4_gemm that dispatches expert blocks via blockIdx.y on-device, eliminating host-side loops and GPU-CPU sync. Weights are in skinny layout [E, N, K//8] int32 (ExLlama shuffle). Key optimizations for RDNA 3.5 decode (batch=1): - Use all CUs per expert block for maximum bandwidth - YTILE=2 for N=1 decode (better occupancy than YTILE=1 or 4) - Reduced LDS allocation (16KB vs 64KB) for higher occupancy - Non-temporal weight loads to avoid L1 pollution - Scattered mode with sorted_token_ids for decode without pre-permutation Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Dispatch MoE INT4 GEMM based on batch size: Triton for prefill (M>5), HIP wvSplitK for decode (M<=5). Both read from the same shuffle-packed [E, N, K//8] int32 weights — no duplication. The Triton path adds use_shuffle_w4a16 to fused_moe_kernel_gptq_awq which unpacks ExLlama-shuffled int32 via tl.interleave, then extracts nibbles with shift+mask. Scales are [E, N, K//G], symmetric only. Weight processing converts GPTQ [E, K/8, N] to skinny [E, N, K//8] with ExLlama shuffle packing at load time. Enabled by default on ROCm via VLLM_MOE_HYBRID_W4A16=true. Qwen3-Omni-30B-A3B AWQ on Strix Halo (vs exllama baseline): TPOT: 14.51ms → 13.73ms (-5.4%) TTFT: 996ms → 841ms (-15.6%) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Wrap the two fused_moe_wvSplitK_int4_gemm calls on the HIP decode path with record_function_or_nullcontext so profiler traces attribute time to named events of the form fused_moe_wvsplitk_int4 MxNxK E=<num_experts> top_k=<k> matching the convention used by the GPTQ/AWQ and conch MoE paths. Made-with: Cursor

github-actions · 2026-04-24T17:01:15Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mgehre-amd and others added 3 commits April 15, 2026 04:44

mgehre-amd force-pushed the matthias.moe-perf-opt branch from d4ed7ea to 39132b5 Compare April 27, 2026 10:23

mgehre-amd requested a review from gshtras as a code owner April 27, 2026 10:23

mgehre-amd force-pushed the matthias.moe-perf-opt branch from 39132b5 to f898e10 Compare April 27, 2026 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add profiler record_function annotation to HybridW4A16MoEExperts decode#896

Add profiler record_function annotation to HybridW4A16MoEExperts decode#896
roberteg16 wants to merge 3 commits intomatthias.moe-perf-optfrom
rogarcia.hybrid_w4a16_moe_record_function

roberteg16 commented Apr 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

roberteg16 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

roberteg16 commented Apr 24, 2026 •

edited

Loading