Skip to content

Add profiler record_function annotation to HybridW4A16MoEExperts decode#896

Open
roberteg16 wants to merge 3 commits intomatthias.moe-perf-optfrom
rogarcia.hybrid_w4a16_moe_record_function
Open

Add profiler record_function annotation to HybridW4A16MoEExperts decode#896
roberteg16 wants to merge 3 commits intomatthias.moe-perf-optfrom
rogarcia.hybrid_w4a16_moe_record_function

Conversation

@roberteg16
Copy link
Copy Markdown

@roberteg16 roberteg16 commented Apr 24, 2026

Summary

Wrap the two fused_moe_wvSplitK_int4_gemm calls on the HIP decode path of
HybridW4A16MoEExperts with record_function_or_nullcontext so profiler
traces attribute time to named events of the form:

fused_moe_wvsplitk_int4 MxNxK E=<num_experts> top_k=<k>

This matches the convention already used by the GPTQ/AWQ and conch MoE
paths (e.g. fused_moe_gptq_awq MxNxK E=... top_k=...), and lets the
rocm-scripts profile-bandwidth summary (and other downstream regex-based
tooling) pick up per-shape stats for the int4 wvSplitK MoE decode path.

No functional change; only profiler annotations are added.

Test plan

  • Ran vLLM with --profile/torch_profiler_record_shapes=True;
    confirmed the profiler table now shows
    fused_moe_wvsplitk_int4 MxNxK E=... top_k=... entries for GEMM1
    and GEMM2 of the HIP decode path.

Made with Cursor

mgehre-amd and others added 3 commits April 15, 2026 04:44
Adds fused_moe_wvSplitK_int4_gemm that dispatches expert blocks via
blockIdx.y on-device, eliminating host-side loops and GPU-CPU sync.
Weights are in skinny layout [E, N, K//8] int32 (ExLlama shuffle).

Key optimizations for RDNA 3.5 decode (batch=1):
- Use all CUs per expert block for maximum bandwidth
- YTILE=2 for N=1 decode (better occupancy than YTILE=1 or 4)
- Reduced LDS allocation (16KB vs 64KB) for higher occupancy
- Non-temporal weight loads to avoid L1 pollution
- Scattered mode with sorted_token_ids for decode without
  pre-permutation

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Dispatch MoE INT4 GEMM based on batch size: Triton for prefill (M>5),
HIP wvSplitK for decode (M<=5). Both read from the same shuffle-packed
[E, N, K//8] int32 weights — no duplication.

The Triton path adds use_shuffle_w4a16 to fused_moe_kernel_gptq_awq
which unpacks ExLlama-shuffled int32 via tl.interleave, then extracts
nibbles with shift+mask. Scales are [E, N, K//G], symmetric only.

Weight processing converts GPTQ [E, K/8, N] to skinny [E, N, K//8]
with ExLlama shuffle packing at load time. Enabled by default on ROCm
via VLLM_MOE_HYBRID_W4A16=true.

Qwen3-Omni-30B-A3B AWQ on Strix Halo (vs exllama baseline):
  TPOT: 14.51ms → 13.73ms (-5.4%)
  TTFT: 996ms → 841ms (-15.6%)

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Wrap the two fused_moe_wvSplitK_int4_gemm calls on the HIP decode path
with record_function_or_nullcontext so profiler traces attribute time to
named events of the form
  fused_moe_wvsplitk_int4 MxNxK E=<num_experts> top_k=<k>
matching the convention used by the GPTQ/AWQ and conch MoE paths.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mgehre-amd mgehre-amd force-pushed the matthias.moe-perf-opt branch from d4ed7ea to 39132b5 Compare April 27, 2026 10:23
@mgehre-amd mgehre-amd requested a review from gshtras as a code owner April 27, 2026 10:23
@mgehre-amd mgehre-amd force-pushed the matthias.moe-perf-opt branch from 39132b5 to f898e10 Compare April 27, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants