Skip to content

Fix Triton WNA16 MoE fallback for CompressedTensorsWNA16MoEMethod#875

Draft
mgehre-amd wants to merge 2 commits intogfx11from
matthias.ttft-optimization
Draft

Fix Triton WNA16 MoE fallback for CompressedTensorsWNA16MoEMethod#875
mgehre-amd wants to merge 2 commits intogfx11from
matthias.ttft-optimization

Conversation

@mgehre-amd
Copy link
Copy Markdown

@mgehre-amd mgehre-amd commented Apr 15, 2026

The Triton WNA16 MoE path (triggered by VLLM_MOE_GPTQ_EXLLAMA=false) crashed with AttributeError because moe_mk and moe_quant_config were never initialized in init. These attributes are only set in the exllama and AWQ GEMV code paths, but the apply() method checks self.moe_mk and falls through to the Triton fused_experts() path which requires self.moe_quant_config.

I probably broke this when integrating the new exllama kernel, so fix this back.

@mgehre-amd mgehre-amd requested a review from eble-amd April 15, 2026 08:21
The Triton WNA16 MoE path (triggered by VLLM_MOE_GPTQ_EXLLAMA=false)
crashed with AttributeError because moe_mk and moe_quant_config were
never initialized in __init__. These attributes are only set in the
exllama and AWQ GEMV code paths, but the apply() method checks
self.moe_mk and falls through to the Triton fused_experts() path
which requires self.moe_quant_config.

On Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit (128 experts, top-8, w4a16
group_size=32), switching from ExllamaExperts to Triton WNA16 MoE
reduces TTFT from 1345ms to 929ms (31% improvement) because:
- Triton fused MoE handles gather/scatter + GEMM in a single kernel
- Eliminates atomicAdd K-tiling overhead from the exllama contiguous kernel
- Router gate GEMM drops from 8.2ms to 56us per layer

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd force-pushed the matthias.ttft-optimization branch from 075bf94 to 0e3cf4c Compare April 15, 2026 08:27
These attributes are initialized to None in __init__ but later assigned
FusedMoEKernel / FusedMoEQuantConfig values. Without explicit type
annotations, mypy inferred the type as None and flagged the assignments
as incompatible.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Comment thread vllm/envs.py
logger = logging.getLogger(__name__)


def _is_rdna_for_moe_default() -> bool:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and the venv variable and do proper is_compatible checks in compressed_tensor_moe.py

@pytest.mark.parametrize("e,topk", [(8, 2), (16, 4)])
def test_triton_wna16_moe(m: int, n: int, k: int, e: int, topk: int):
"""Test the Triton WNA16 MoE fallback path (VLLM_MOE_GPTQ_EXLLAMA=false).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test CompressedTensorsWNA16MoEMethod.apply()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant