Optimize bf16 wvSplitK_int4 dequant and DOT2C for gfx1151 by mgehre-amd · Pull Request #909 · ROCm/vllm

mgehre-amd · 2026-04-29T11:36:30Z

Optimize bf16 wvSplitK_int4 dequant and DOT2C for gfx1151

Summary

Two changes to close the bf16/fp16 decode throughput gap for int4 quantized models on RDNA 3.5 (gfx1151):

v_dot2_f32_bf16 intrinsic in DOT2C: Replaces the scalar bf16->fp32->mul->add chain with the native __builtin_amdgcn_fdot2_f32_bf16 hardware dot instruction.
Accumulator bias correction: Instead of per-element bf16 bias subtraction during dequant (which requires an fp32 round-trip on gfx1151 since there is no native v_pk_sub_bf16), use the magic values (nibble | 0x4300) directly in DOT2C and correct the accumulator afterward with bias * sum(activations). This replaces ~400 dequant instructions with ~24 dot2+fma instructions per inner loop iteration.

Both dense (wvSplitK_int4_compute_sml_, wvSplitK_int4_compute_) and MoE kernel paths are updated.

Benchmark Results (Strix Halo, gfx1151)

Model	Type	Before	After	Speedup
RedHatAI/Qwen3-4B-quantized.w4a16	Dense (bf16, sym, gs=128)	66.9 tok/s	75.4 tok/s	+12.7%
cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit	MoE (bf16, sym, gs=32)	67.4 tok/s	75.0 tok/s	+11.3%

Benchmark config: --input-len 1 --output-len 128 --num-prompts 3

Validation

Exhaustive dequant test: all 32 nibble x bias combos bit-exact
Dense kernel test: 0.00 relative error vs fp32 reference (8 shapes)
GSM8K --limit 200: strict-match 86.5% before and after (identical)

How it works

On gfx1151, bf16 int4 dequant previously used a scalar path that extracted each nibble, cast to int, subtracted the bias, and cast to bf16 -- generating ~227 v_bfe_u32 + v_cvt instructions per inner loop.

The magic-number trick packs each 4-bit nibble into a bf16 value representing 128 + nibble using a single OR with 0x4300 (the bf16 encoding of 128.0). These magic values are fed directly into v_dot2_f32_bf16 dot products.

Since the dot product now computes sum(a_i * (128 + w_i)) instead of sum(a_i * (w_i - 8)), we correct the accumulator afterward:

partial -= 136.0 * sum(activations)      // symmetric: bias = 128 + 8
partial -= (128.0 + zp) * sum(activations)  // asymmetric with zero_point

The sum(activations) is computed with the same DOT2C instruction using a vector of bf16 1.0 values (0x3F803F80), adding only ~24 instructions vs the ~400 eliminated.

This optimization does not apply to fp16, which already has native v_pk_sub_f16 (4-byte VOP2, dual-issue capable).

Two changes to close the bf16/fp16 decode throughput gap on RDNA 3.5: 1. Use v_dot2_f32_bf16 intrinsic in DOT2C instead of scalar bf16->fp32->mul->add chain. 2. Accumulator bias correction: instead of per-element bf16 bias subtraction (requires fp32 round-trip on gfx1151, ~400 instructions), use magic values (nibble|0x4300) directly in DOT2C and correct the accumulator afterward with bias * sum(activations) (~24 instructions). Both dense and MoE int4 kernels are updated (wvSplitK_int4_compute_sml_ and wvSplitK_int4_compute_ share the same optimization). bf16 MoE decode: 65.7 → 76.5 tok/s (gap vs fp16 from 17.4% to 0.8%). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

mgehre-amd requested a review from roberteg16 April 29, 2026 11:57

mgehre-amd marked this pull request as ready for review April 29, 2026 11:57

mgehre-amd requested a review from gshtras as a code owner April 29, 2026 11:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize bf16 wvSplitK_int4 dequant and DOT2C for gfx1151#909

Optimize bf16 wvSplitK_int4 dequant and DOT2C for gfx1151#909
mgehre-amd wants to merge 1 commit intogfx11from
matthias.bf16-bias-correction

mgehre-amd commented Apr 29, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgehre-amd commented Apr 29, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!