Use fp16 fdot2 for bf16 int4 GEMV dequant on RDNA 3.5 by mgehre-amd · Pull Request #877 · ROCm/vllm

mgehre-amd · 2026-04-15T10:45:26Z

Replace the scalar bf16→f32 dequant path in wvSplitK_int4 with an optimized path that dequantizes INT4 weights to fp16 using the bit-trick (same as the fp16 path), converts bf16 activations to fp16, then uses the native __builtin_amdgcn_fdot2 instruction for the dot product.

This is ~2.5x fewer ALU ops than the scalar bf16 conversion path.

Replace the scalar bf16→f32 dequant path in wvSplitK_int4 with an optimized path that dequantizes INT4 weights to fp16 using the bit-trick (same as the fp16 path), converts bf16 activations to fp16, then uses the native __builtin_amdgcn_fdot2 instruction for the dot product. This is ~2.5x fewer ALU ops than the scalar bf16 conversion path. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use fp16 fdot2 for bf16 int4 GEMV dequant on RDNA 3.5#877

Use fp16 fdot2 for bf16 int4 GEMV dequant on RDNA 3.5#877
mgehre-amd wants to merge 1 commit intogfx11from
matthias.bf16-fdot2-dequant

mgehre-amd commented Apr 15, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgehre-amd commented Apr 15, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgehre-amd commented Apr 15, 2026 •

edited by github-actions Bot

Loading