Support group_size=64 in HybridW4A16 and wvSplitK_int4_g#905
Open
mgehre-amd wants to merge 3 commits intogfx11from
Open
Support group_size=64 in HybridW4A16 and wvSplitK_int4_g#905mgehre-amd wants to merge 3 commits intogfx11from
mgehre-amd wants to merge 3 commits intogfx11from
Conversation
The HIP wvSplitK_int4_g C++ kernel only supported group_size 32 and 128, but HybridW4A16LinearKernel accepted 32, 64, 128, and 256. When a model using group_size=64 (e.g. RedHatAI/Qwen3-1.7B-quantized.w4a16) hit the decode path, the C++ kernel rejected it at runtime. The kernel template already handles arbitrary group sizes that are multiples of A_CHUNK (16), so the fix extends the TORCH_CHECK and the WVSPLIT_INT4G_GS dispatch macro to include 64. SUPPORTED_GROUP_SIZES is narrowed to [32, 64, 128] so there is no mismatch between what can_implement accepts and what the C++ kernel supports. Build time impact: skinny_gemms_int4.hip.o compile time increases from 158s to 233s (+47%) due to the additional template instantiations for group_size=64. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
eble-amd
reviewed
Apr 27, 2026
The main wvSplitK_int4_g function was updated for group_size=64 but the two VLLM_SKINNY_GEMM_SWEEP sweep variants (wvSplitK_int4g_sweep and wvSplitK_int4g_hf_sweep) still had hard-coded 32/128 checks and dispatch. Also updates the docstring on wvSplitK_int4_g. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Resolve conflict in skinny_gemms_int4.cu: gfx11 moved dispatch macros to file scope (shared with MoE); apply group_size=64 and N=1 tuning changes to both the regular and MoE macro sets. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
eble-amd
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The HIP wvSplitK_int4_g C++ kernel only supported group_size 32 and 128, but HybridW4A16LinearKernel accepted 32, 64, 128, and 256. When a model using group_size=64 (e.g. RedHatAI/Qwen3-1.7B-quantized.w4a16) hit the decode path, the C++ kernel rejected it at runtime.
The kernel template already handles arbitrary group sizes that are multiples of A_CHUNK (16), so the fix extends the TORCH_CHECK and the WVSPLIT_INT4G_GS dispatch macro to include 64. SUPPORTED_GROUP_SIZES is narrowed to [32, 64, 128] so there is no mismatch between what can_implement accepts and what the C++ kernel supports.
Build time impact: skinny_gemms_int4.hip.o compile time increases from 158s to 233s (+47%) due to the additional template instantiations for group_size=64.