Skip to content

Support group_size=64 in HybridW4A16 and wvSplitK_int4_g#905

Open
mgehre-amd wants to merge 3 commits intogfx11from
matthias.fix-group-size-64
Open

Support group_size=64 in HybridW4A16 and wvSplitK_int4_g#905
mgehre-amd wants to merge 3 commits intogfx11from
matthias.fix-group-size-64

Conversation

@mgehre-amd
Copy link
Copy Markdown

@mgehre-amd mgehre-amd commented Apr 27, 2026

The HIP wvSplitK_int4_g C++ kernel only supported group_size 32 and 128, but HybridW4A16LinearKernel accepted 32, 64, 128, and 256. When a model using group_size=64 (e.g. RedHatAI/Qwen3-1.7B-quantized.w4a16) hit the decode path, the C++ kernel rejected it at runtime.

The kernel template already handles arbitrary group sizes that are multiples of A_CHUNK (16), so the fix extends the TORCH_CHECK and the WVSPLIT_INT4G_GS dispatch macro to include 64. SUPPORTED_GROUP_SIZES is narrowed to [32, 64, 128] so there is no mismatch between what can_implement accepts and what the C++ kernel supports.

Build time impact: skinny_gemms_int4.hip.o compile time increases from 158s to 233s (+47%) due to the additional template instantiations for group_size=64.

The HIP wvSplitK_int4_g C++ kernel only supported group_size 32 and 128,
but HybridW4A16LinearKernel accepted 32, 64, 128, and 256. When a model
using group_size=64 (e.g. RedHatAI/Qwen3-1.7B-quantized.w4a16) hit the
decode path, the C++ kernel rejected it at runtime.

The kernel template already handles arbitrary group sizes that are
multiples of A_CHUNK (16), so the fix extends the TORCH_CHECK and the
WVSPLIT_INT4G_GS dispatch macro to include 64. SUPPORTED_GROUP_SIZES
is narrowed to [32, 64, 128] so there is no mismatch between
what can_implement accepts and what the C++ kernel supports.

Build time impact: skinny_gemms_int4.hip.o compile time increases from
158s to 233s (+47%) due to the additional template instantiations for
group_size=64.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd requested a review from gshtras as a code owner April 27, 2026 12:33
@mgehre-amd mgehre-amd removed the request for review from gshtras April 27, 2026 12:33
Copy link
Copy Markdown

@roberteg16 roberteg16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread csrc/rocm/skinny_gemms_int4.cu Outdated
Comment thread csrc/rocm/skinny_gemms_int4.cu
Comment thread csrc/rocm/skinny_gemms_int4.cu Outdated
Comment thread csrc/rocm/skinny_gemms_int4.cu
Comment thread csrc/rocm/skinny_gemms_int4.cu Outdated
The main wvSplitK_int4_g function was updated for group_size=64 but the
two VLLM_SKINNY_GEMM_SWEEP sweep variants (wvSplitK_int4g_sweep and
wvSplitK_int4g_hf_sweep) still had hard-coded 32/128 checks and dispatch.
Also updates the docstring on wvSplitK_int4_g.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Resolve conflict in skinny_gemms_int4.cu: gfx11 moved dispatch macros
to file scope (shared with MoE); apply group_size=64 and N=1 tuning
changes to both the regular and MoE macro sets.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants