Optimize bf16 wvSplitK_int4 dequant and DOT2C for gfx1151#909
Open
mgehre-amd wants to merge 1 commit intogfx11from
Open
Optimize bf16 wvSplitK_int4 dequant and DOT2C for gfx1151#909mgehre-amd wants to merge 1 commit intogfx11from
mgehre-amd wants to merge 1 commit intogfx11from
Conversation
Two changes to close the bf16/fp16 decode throughput gap on RDNA 3.5: 1. Use v_dot2_f32_bf16 intrinsic in DOT2C instead of scalar bf16->fp32->mul->add chain. 2. Accumulator bias correction: instead of per-element bf16 bias subtraction (requires fp32 round-trip on gfx1151, ~400 instructions), use magic values (nibble|0x4300) directly in DOT2C and correct the accumulator afterward with bias * sum(activations) (~24 instructions). Both dense and MoE int4 kernels are updated (wvSplitK_int4_compute_sml_ and wvSplitK_int4_compute_ share the same optimization). bf16 MoE decode: 65.7 → 76.5 tok/s (gap vs fp16 from 17.4% to 0.8%). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimize bf16 wvSplitK_int4 dequant and DOT2C for gfx1151
Summary
Two changes to close the bf16/fp16 decode throughput gap for int4 quantized models on RDNA 3.5 (gfx1151):
v_dot2_f32_bf16 intrinsic in DOT2C: Replaces the scalar bf16->fp32->mul->add chain with the native
__builtin_amdgcn_fdot2_f32_bf16hardware dot instruction.Accumulator bias correction: Instead of per-element bf16 bias subtraction during dequant (which requires an fp32 round-trip on gfx1151 since there is no native
v_pk_sub_bf16), use the magic values(nibble | 0x4300)directly in DOT2C and correct the accumulator afterward withbias * sum(activations). This replaces ~400 dequant instructions with ~24 dot2+fma instructions per inner loop iteration.Both dense (
wvSplitK_int4_compute_sml_,wvSplitK_int4_compute_) and MoE kernel paths are updated.Benchmark Results (Strix Halo, gfx1151)
Benchmark config:
--input-len 1 --output-len 128 --num-prompts 3Validation
How it works
On gfx1151, bf16 int4 dequant previously used a scalar path that extracted each nibble, cast to int, subtracted the bias, and cast to bf16 -- generating ~227
v_bfe_u32+v_cvtinstructions per inner loop.The magic-number trick packs each 4-bit nibble into a bf16 value representing
128 + nibbleusing a single OR with0x4300(the bf16 encoding of 128.0). These magic values are fed directly intov_dot2_f32_bf16dot products.Since the dot product now computes
sum(a_i * (128 + w_i))instead ofsum(a_i * (w_i - 8)), we correct the accumulator afterward:The
sum(activations)is computed with the same DOT2C instruction using a vector of bf16 1.0 values (0x3F803F80), adding only ~24 instructions vs the ~400 eliminated.This optimization does not apply to fp16, which already has native
v_pk_sub_f16(4-byte VOP2, dual-issue capable).