Skip to content

Optimize bf16 wvSplitK_int4 dequant and DOT2C for gfx1151#909

Open
mgehre-amd wants to merge 1 commit intogfx11from
matthias.bf16-bias-correction
Open

Optimize bf16 wvSplitK_int4 dequant and DOT2C for gfx1151#909
mgehre-amd wants to merge 1 commit intogfx11from
matthias.bf16-bias-correction

Conversation

@mgehre-amd
Copy link
Copy Markdown

@mgehre-amd mgehre-amd commented Apr 29, 2026

Optimize bf16 wvSplitK_int4 dequant and DOT2C for gfx1151

Summary

Two changes to close the bf16/fp16 decode throughput gap for int4 quantized models on RDNA 3.5 (gfx1151):

  1. v_dot2_f32_bf16 intrinsic in DOT2C: Replaces the scalar bf16->fp32->mul->add chain with the native __builtin_amdgcn_fdot2_f32_bf16 hardware dot instruction.

  2. Accumulator bias correction: Instead of per-element bf16 bias subtraction during dequant (which requires an fp32 round-trip on gfx1151 since there is no native v_pk_sub_bf16), use the magic values (nibble | 0x4300) directly in DOT2C and correct the accumulator afterward with bias * sum(activations). This replaces ~400 dequant instructions with ~24 dot2+fma instructions per inner loop iteration.

Both dense (wvSplitK_int4_compute_sml_, wvSplitK_int4_compute_) and MoE kernel paths are updated.

Benchmark Results (Strix Halo, gfx1151)

Model Type Before After Speedup
RedHatAI/Qwen3-4B-quantized.w4a16 Dense (bf16, sym, gs=128) 66.9 tok/s 75.4 tok/s +12.7%
cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit MoE (bf16, sym, gs=32) 67.4 tok/s 75.0 tok/s +11.3%

Benchmark config: --input-len 1 --output-len 128 --num-prompts 3

Validation

  • Exhaustive dequant test: all 32 nibble x bias combos bit-exact
  • Dense kernel test: 0.00 relative error vs fp32 reference (8 shapes)
  • GSM8K --limit 200: strict-match 86.5% before and after (identical)

How it works

On gfx1151, bf16 int4 dequant previously used a scalar path that extracted each nibble, cast to int, subtracted the bias, and cast to bf16 -- generating ~227 v_bfe_u32 + v_cvt instructions per inner loop.

The magic-number trick packs each 4-bit nibble into a bf16 value representing 128 + nibble using a single OR with 0x4300 (the bf16 encoding of 128.0). These magic values are fed directly into v_dot2_f32_bf16 dot products.

Since the dot product now computes sum(a_i * (128 + w_i)) instead of sum(a_i * (w_i - 8)), we correct the accumulator afterward:

partial -= 136.0 * sum(activations)      // symmetric: bias = 128 + 8
partial -= (128.0 + zp) * sum(activations)  // asymmetric with zero_point

The sum(activations) is computed with the same DOT2C instruction using a vector of bf16 1.0 values (0x3F803F80), adding only ~24 instructions vs the ~400 eliminated.

This optimization does not apply to fp16, which already has native v_pk_sub_f16 (4-byte VOP2, dual-issue capable).

Two changes to close the bf16/fp16 decode throughput gap on RDNA 3.5:

1. Use v_dot2_f32_bf16 intrinsic in DOT2C instead of scalar
   bf16->fp32->mul->add chain.

2. Accumulator bias correction: instead of per-element bf16 bias
   subtraction (requires fp32 round-trip on gfx1151, ~400 instructions),
   use magic values (nibble|0x4300) directly in DOT2C and correct the
   accumulator afterward with bias * sum(activations) (~24 instructions).

Both dense and MoE int4 kernels are updated (wvSplitK_int4_compute_sml_
and wvSplitK_int4_compute_ share the same optimization).

bf16 MoE decode: 65.7 → 76.5 tok/s (gap vs fp16 from 17.4% to 0.8%).

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd requested a review from roberteg16 April 29, 2026 11:57
@mgehre-amd mgehre-amd marked this pull request as ready for review April 29, 2026 11:57
@mgehre-amd mgehre-amd requested a review from gshtras as a code owner April 29, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant