NVFP4 recipe with GEMM via BF16 dequant by matthiasdiener · Pull Request #518 · ROCm/TransformerEngine

matthiasdiener · 2026-04-02T16:44:44Z

Description

Part of https://github.com/ROCm/frameworks-internal/issues/15682

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Remove TODO regarding userbuffers

Userbuffer Enablement for ROCm

* Update Dockerfile to use ROCm TheRock * Update wheels building script to work with ROCm TheRock and the latest Manylinux image * Support default ROCm location /opt/rocm/core * Fix UB code build on TheRock * Support comma separated list of target GPU architectures * Guess ROCm build from HIP_PLATFORM

aris134 · 2026-04-27T12:35:20Z

+    const float fp8_max = te_fp8_fnuz() ? 240.0f : 448.0f;
+    const float factor_inv = 1.0f / (6.0f * fp8_max);


Same comment as above regarding using Numeric_Traits_fp8e4m3 here

Done in a08e8c5

aris134 · 2026-04-27T13:37:37Z

+  if (is_fp4_dtype(param.Atype)) {
+    hip_bfloat16* a_bf16 = reinterpret_cast<hip_bfloat16*>(ws_ptr);
+    ws_ptr += a_bf16_bytes;
+    const int64_t total_a = static_cast<int64_t>(m) * k;
+    const auto& a_sinv = (transa == CUBLAS_OP_T) ? inputA.scale_inv
+                                                 : inputA.columnwise_scale_inv;
+    const int64_t a_num_cols = (transa == CUBLAS_OP_T)
+        ? inputA.data.shape.back()
+        : inputA.columnwise_data.shape.back();
+    const int64_t a_scale_stride = (a_sinv.shape.size() >= 2) ? a_sinv.shape[1] : (a_num_cols / 16);
+    launch_dequant_fp4_to_bf16(param.A, param.A_scale_inv, a_bf16, total_a,
+                               a_num_cols, a_scale_stride, stream);
+    param.A = a_bf16;
+    param.Atype = DType::kBFloat16;
+    param.A_scale_inv = nullptr;
+  }
+
+  if (is_fp4_dtype(param.Btype)) {
+    hip_bfloat16* b_bf16 = reinterpret_cast<hip_bfloat16*>(ws_ptr);
+    ws_ptr += b_bf16_bytes;
+    const int64_t total_b = static_cast<int64_t>(k) * n;
+    const auto& b_sinv = (transb == CUBLAS_OP_N) ? inputB.scale_inv
+                                                 : inputB.columnwise_scale_inv;
+    const int64_t b_num_cols = (transb == CUBLAS_OP_N)
+        ? inputB.data.shape.back()
+        : inputB.columnwise_data.shape.back();
+    const int64_t b_scale_stride = (b_sinv.shape.size() >= 2) ? b_sinv.shape[1] : (b_num_cols / 16);
+    launch_dequant_fp4_to_bf16(param.B, param.B_scale_inv, b_bf16, total_b,
+                               b_num_cols, b_scale_stride, stream);
+    param.B = b_bf16;
+    param.Btype = DType::kBFloat16;
+    param.B_scale_inv = nullptr;
+  }


Minor comment: would it make sense to factor the repeated FP4→BF16 staging logic for A/B into a small helper? The two blocks look structurally similar, aside from the operand-specific shape/layout details.

Thanks, I factored this out into a lambda function in fae76d3

aris134

LGTM. I left one minor non-blocking suggestion, but this looks good to me overall.

ipanfilo · 2026-04-28T16:51:28Z

    << "type_a" << "type_b" << "type_d" << "bias_type" << "aux_type"
    << "lda" << "ldb" << "ldd" << "scale_mode" << "epi" << "comp" << "scale_type"
-    << "ws_min" << "ws_max" << "algo_id" << "aidx";
+    << "ws_min" << "ws_max" << "algo_id" << "aidx" << "fp4_alpha";


nit: please move it before ws_min. Those last 4 parameters do not participate in key

Moved in a6f4787

ipanfilo · 2026-04-28T16:52:26Z

      std::getline(is, scale, csv_sep);
      is >> ws_min >> c >> ws_max >> c >> algo_id >> c >> algo_idx;
+      int fp4_alpha = 0;
+      if (is.peek() == csv_sep) {


Not needed, by contract the cache should be rebuilt with new TE so no backward compatibility

Removed in a6f4787

ipanfilo · 2026-04-28T17:01:50Z

+  return tile_dim * ((tile_dim / kNVecSMem) + 1) * kNVecSMem;
+}
+#else
+constexpr int kTileDim = 128;


kTileDim is declared at line 143, kThreadsPerBlock at 148

Removed duplication in a6f4787

ipanfilo · 2026-04-28T17:08:42Z

+#ifdef __HIP_PLATFORM_AMD__
+// On AMD, kTileDim_ is a template parameter of the kernel for runtime dispatch:
+//   gfx942: kTileDim_=64  (64 KB LDS, kThreadsPerBlock=128, 4 warps)
+//   gfx950: kTileDim_=128 (128 KB LDS, kThreadsPerBlock=256, 8 warps)


If the values are hardcoded depending on the platform they do not need to be template parameter but constexpr guarded with platform-specific ifdefs

Replace templates with constexpr parameters in a6f4787

Micky774 and others added 30 commits March 27, 2026 09:27

Typo fix (#397)

d954c6d

ROCm UserBuffers for Comm Overlap

7b5cf20

Copyrights and cleanup

640f7e8

test guards

82faeec

Cleanup and RS flag race condition fix

b6a3ae4

Debugging midpoint

9e32d3a

Cleanup and workspace fix

84209ad

Guard layer registration in UB

c669bd2

Cleanup of profiling example for rocm

8040909

Readd example script and update custom_map

e375923

fix typo

c6bd974

MI300 test skips due to jittery results

d76aa06

Comment regarding sm_margin performance

ae979d0

Variable renamed, pybind fix, tolerance tightening

b58cbd1

Remove git conflict

e5d7446

Address style and hip/cu specific paths

7734ce5

HIP guards

c169c75

initial impl

80e0aab

Merge remote-tracking branch 'origin/dev' into mdiener/fp4_hadamard

de7863a

test update

bda7b13

Update extensions.h

7ddb539

Remove TODO regarding userbuffers

amax opt

63c7a48

simplify

a260459

Merge pull request #367 from ROCm/userbuffer_epic

3dd8af9

Userbuffer Enablement for ROCm

Merge remote-tracking branch 'origin/dev' into mdiener/fp4_hadamard

ab217cb

simplify pt 2

26c5fb7

expand test

2087f24

compute amax from BF16-rounded outputs

05cedb7

Typo fix (#397)

465d547

matthiasdiener force-pushed the mdiener/nvfp4-gemm branch from ada428f to a91eaf0 Compare April 23, 2026 20:12

matthiasdiener requested review from aris134 and ipanfilo April 23, 2026 20:31

Merge remote-tracking branch 'origin/dev' into mdiener/nvfp4-gemm

29cacef

ipanfilo reviewed Apr 23, 2026

View reviewed changes

Comment thread transformer_engine/common/hadamard_transform/wht16.cuh

Comment thread tests/pytorch/nvfp4/test_nvfp4_gemm_exact.py

Comment thread transformer_engine/pytorch/cpp_extensions/gemm.py

Comment thread transformer_engine/common/gemm/rocm_gemm.cu Outdated

address review comments

0f53a9d

matthiasdiener requested a review from ipanfilo April 24, 2026 03:47

matthiasdiener added 2 commits April 24, 2026 09:54

Merge remote-tracking branch 'origin/dev' into mdiener/nvfp4-gemm

2851bd9

Merge remote-tracking branch 'origin/dev' into mdiener/nvfp4-gemm

3deb3ff

aris134 reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/cpp/operator/test_cast_nvfp4_transpose.cu

aris134 reviewed Apr 27, 2026

View reviewed changes

use maxNorm

a08e8c5

aris134 reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/pytorch/nvfp4/test_nvfp4_gemm_exact.py

aris134 reviewed Apr 27, 2026

View reviewed changes

aris134 approved these changes Apr 27, 2026

View reviewed changes

matthiasdiener added 3 commits April 27, 2026 09:11

factor out FP4 staging

fae76d3

Merge remote-tracking branch 'origin/dev' into mdiener/nvfp4-gemm

81d3cbd

Merge remote-tracking branch 'upstream/dev' into mdiener/nvfp4-gemm

0afd821

ipanfilo requested changes Apr 28, 2026

View reviewed changes

matthiasdiener force-pushed the mdiener/nvfp4-gemm branch 7 times, most recently from 0f240ad to 9ed88ff Compare April 28, 2026 19:13

address review comments

a6f4787

matthiasdiener force-pushed the mdiener/nvfp4-gemm branch from 9ed88ff to a6f4787 Compare April 28, 2026 19:16

matthiasdiener requested a review from ipanfilo April 28, 2026 19:20

Merge remote-tracking branch 'upstream/dev' into mdiener/nvfp4-gemm

b1575bd

		const float fp8_max = te_fp8_fnuz() ? 240.0f : 448.0f;
		const float factor_inv = 1.0f / (6.0f * fp8_max);

Conversation

matthiasdiener commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aris134 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

matthiasdiener commented Apr 2, 2026 •

edited

Loading