Fix Triton WNA16 MoE fallback for CompressedTensorsWNA16MoEMethod by mgehre-amd · Pull Request #875 · ROCm/vllm

mgehre-amd · 2026-04-15T08:21:57Z

The Triton WNA16 MoE path (triggered by VLLM_MOE_GPTQ_EXLLAMA=false) crashed with AttributeError because moe_mk and moe_quant_config were never initialized in init. These attributes are only set in the exllama and AWQ GEMV code paths, but the apply() method checks self.moe_mk and falls through to the Triton fused_experts() path which requires self.moe_quant_config.

I probably broke this when integrating the new exllama kernel, so fix this back.

The Triton WNA16 MoE path (triggered by VLLM_MOE_GPTQ_EXLLAMA=false) crashed with AttributeError because moe_mk and moe_quant_config were never initialized in __init__. These attributes are only set in the exllama and AWQ GEMV code paths, but the apply() method checks self.moe_mk and falls through to the Triton fused_experts() path which requires self.moe_quant_config. On Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit (128 experts, top-8, w4a16 group_size=32), switching from ExllamaExperts to Triton WNA16 MoE reduces TTFT from 1345ms to 929ms (31% improvement) because: - Triton fused MoE handles gather/scatter + GEMM in a single kernel - Eliminates atomicAdd K-tiling overhead from the exllama contiguous kernel - Router gate GEMM drops from 8.2ms to 56us per layer Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

These attributes are initialized to None in __init__ but later assigned FusedMoEKernel / FusedMoEQuantConfig values. Without explicit type annotations, mypy inferred the type as None and flagged the assignments as incompatible. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

mgehre-amd · 2026-04-15T18:00:42Z

 logger = logging.getLogger(__name__)

+
+def _is_rdna_for_moe_default() -> bool:


Remove this and the venv variable and do proper is_compatible checks in compressed_tensor_moe.py

mgehre-amd · 2026-04-15T18:03:43Z

+@pytest.mark.parametrize("e,topk", [(8, 2), (16, 4)])
+def test_triton_wna16_moe(m: int, n: int, k: int, e: int, topk: int):
+    """Test the Triton WNA16 MoE fallback path (VLLM_MOE_GPTQ_EXLLAMA=false).
+


Also test CompressedTensorsWNA16MoEMethod.apply()

mgehre-amd requested a review from eble-amd April 15, 2026 08:21

mgehre-amd force-pushed the matthias.ttft-optimization branch from 075bf94 to 0e3cf4c Compare April 15, 2026 08:27

mgehre-amd commented Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Triton WNA16 MoE fallback for CompressedTensorsWNA16MoEMethod#875

Fix Triton WNA16 MoE fallback for CompressedTensorsWNA16MoEMethod#875
mgehre-amd wants to merge 2 commits intogfx11from
matthias.ttft-optimization

mgehre-amd commented Apr 15, 2026 •

edited by github-actions Bot

Loading

Uh oh!

mgehre-amd Apr 15, 2026

Uh oh!

mgehre-amd Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		logger = logging.getLogger(__name__)


		def _is_rdna_for_moe_default() -> bool:

Conversation

mgehre-amd commented Apr 15, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgehre-amd Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

mgehre-amd Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgehre-amd commented Apr 15, 2026 •

edited by github-actions Bot

Loading