[TE] Phase 2 of small-seq cross-attn integration: a separate cpp backend and a new jax api by VeeraRajasekhar · Pull Request #542 · ROCm/TransformerEngine

VeeraRajasekhar · 2026-04-15T17:03:18Z

… backend

Refactor the ROCm small-sequence attention path so it is a first-class backend instead of branching from the generic CK fused-attention entry: add NVTE entry points and ROCm implementations under fused_attn_rocm, CMake wiring, and public declarations in fused_attn.h.

Rename the small-seq sources to fused_attn_small_seq.* so filenames match the new API. Extend kernel dispatch to head sizes 128, 256, and 512.

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Integrate the CK team's unfused variable-length attention HIP kernels from varlen_attn/ into Transformer Engine's ROCm fused-attn path as a specialized path for specialized cross-attention (Q length 1, KV length 2-16, large batch).. - Add fused_attn_smallseq.hpp and fused_attn_smallseq.cpp under fused_attn_rocm/: declarations and implementation adapted from varlen_attn/attn_fwd.cpp and attn_bwd.cpp (scores, mask+softmax, output; grad_V, grad_attn, softmax bwd, grad_Q/grad_K). Runtime dispatch over max_seqlen_kv in {2,4,6,8,12,16}, head_dim 128, BF16. - Add fused_attn_smallseq.cpp to the ROCm fused-attn build in transformer_engine/common/CMakeLists.txt. - In fused_attn_ck_fwd: when THD and no bias, branch to small-seq path when max_seqlen_q==1 and 2<=max_seqlen_kv<=16. On shape query (Aux_CTX_Tensors->size == 0) skip get_runtime_max_seqlen (cu_seqlens pointers are null); use host max_seqlen_kv and set output_S to attention-weights shape {max_tokens_q, h_q, 1, runtime_max_seqlen_kv} and dtype QKV_type. On real run (size >= 2) call get_runtime_max_seqlen then fused_attn_smallseq_fwd. Use sequence count b_varlen = max_tokens_q (not segment count b) for get_runtime_max_seqlen, output_S shape, workspace size, and small-seq fwd so varlen kernel indexing matches Q and cu_seqlens_kv (THD may pass segment-level cu_seqlens; varlen kernel expects sequence-level batch). - In fused_attn_ck_bwd: same THD/small-seq condition. On workspace query (workspace->data.dptr == nullptr) skip get_runtime_max_seqlen and use host max_seqlen_kv; on real run call get_runtime_max_seqlen then fused_attn_smallseq_bwd. Use b_varlen = max_tokens_q_bwd for get_runtime_max_seqlen, workspace size, and small-seq bwd. - Reuse softmax LSE auxiliary buffer for attention weights in the small-seq path (forward write, backward read); - JAX attention.py: in NVTE_CK block, when THD and q_max_seqlen==1 and kv_max_seqlen<=16 set softmax_shape = (*batch_shape, attn_heads, q_max_seqlen, kv_max_seqlen) and softmax_dtype = q_dtype so Python aux buffer matches C++ attention-weights convention. - Add test_ck_unfused_smallseq_backend in tests/jax/test_fused_attn.py (parametrized s_kv in {2,4,6,8,12,16}, b=30720, s_q=1, THD_THD_THD, SeqDescFormat.Seqlens) and optional NVTE_LOG_CK_SMALLSEQ debug logging in C++.

…port to small-seq kernels

- tests/jax: CK small-seq tests use fixture to set/restore NVTE_FUSED_ATTN_CK_SMALLSEQ=1; parametrize dtype (BF16/FP16) and add sequence-packing cases (2048-2-4, 2-4096-8192); when env set, num_segments_per_seq = max_seqlen_q for THD else 2. - JAX attention.py: THD softmax shape/dtype uses small-seq path only when env=1, else original layout - JAX attention.cpp: Added env guard - fused_attn_smallseq: Use TRANSFORMER_ENGINE_TYPE_SWITCH_16BIT for fwd/bwd; add FP16 (__half) support; fix __half*float with T(scale).

…ftmax shape bug

… backend Refactor the ROCm small-sequence attention path so it is a first-class backend instead of branching from the generic CK fused-attention entry: add NVTE entry points and ROCm implementations under fused_attn_rocm, CMake wiring, and public declarations in fused_attn.h. Rename the small-seq sources to fused_attn_small_seq.* so filenames match the new API. Extend kernel dispatch to head sizes 128, 256, and 512.

Wire the explicit small-sequence path in JAX csrc. - extensions.h: declare GetSmallSeqAttn{Forward,Backward}WorkspaceSizes, SmallSeqAttn{Forward,Backward}FFI handlers, and XLA_FFI_DECLARE_HANDLER_SYMBOL exports for XLA registration. - attention.cpp (USE_ROCM): add PrepareSmallSeqAttnForwardAuxTensors / PrepareSmallSeqAttnBackwardAuxTensors to build NVTETensorPack for small-seq (softmax slot = attention-weights buffer layout, RNG slot per fused API contract); memset ragged output/softmax aux as needed for THD. - GetSmallSeqAttnForwardWorkspaceSizes / GetSmallSeqAttnBackwardWorkspaceSizes: gate on nvte_is_small_seq_attn_supported, return minimal forward workspace and nvte_fused_attn_small_seq_bwd_workspace_size-backed backward scratch. - SmallSeqAttnForwardImpl / SmallSeqAttnBackwardImpl: reuse FUSED_ATTN_IMPL_COMMON_BLOCK for THD cu_seqlens/offsets, call nvte_fused_attn_small_seq_fwd / _bwd. - SmallSeqAttnForwardFFI / SmallSeqAttnBackwardFFI + XLA_FFI_DEFINE_HANDLER_SYMBOL: mirror FusedAttn*FFI attribute unpacking so JAX can invoke the dedicated backend.

Add SmallSeqAttnFwdPrimitive / SmallSeqAttnBwdPrimitive in cpp_extensions/attention.py so JAX compiles and lowers to the dedicated small-seq FFI without nvte_get_fused_attn_backend or generic fused-attn workspace probing. - abstract: HIP-only; validate THD_THD_THD, no bias/dropout and head dims; softmax_aux shape (*batch, heads, q, min(kv,16)) in Q dtype; workspace from get_small_seq_attn_{fwd,bwd}_workspace_sizes. - lowering: ffi_lowering to te_small_seq_attn_{forward,backward}_ffi with the same flattened attrs pattern as generic fused attention. - fused_attn_small_seq_fwd / fused_attn_small_seq_bwd: thin bind helpers; export via __all__. register_primitive for both primitives.

Expose ROCm small-sequence cross-attention at the JAX layer next to fused_attn. - Custom primitive _fused_attn_small_seq with forward/backward rules calling cpp_extensions fused_attn_small_seq_fwd/bwd (tex.*). - fused_attn_small_seq(): user entry point taking (q,k,v), bias slot, SequenceDescriptor, seed, mask/layout/scaling/dropout/is_training — targets the explicit small-seq backend.

…smallseq-cross-attn-new-backend

wangye805 · 2026-04-27T17:09:16Z

                    "the F16_arbitrary_seqlen backend."
                )

+    def _setup_thd_segments_small_seq(self, generate_random_segment_ids):


Since our customer already mentioned that their sq is 1 for each segment, without padding at all, and their sq_kv <=16 including padding, we can create our own small_seq input generation to separate with the non-small-seq tests

wangye805 · 2026-04-27T17:13:34Z

+  const Tensor *input_cu_seqlens_kv = convertNVTETensorCheck(cu_seqlens_kv);
+  const Tensor *input_cu_seqlens_q_padded = convertNVTETensorCheck(cu_seqlens_q_padded);
+  const Tensor *input_cu_seqlens_kv_padded = convertNVTETensorCheck(cu_seqlens_kv_padded);
+  const Tensor *input_rng_state = convertNVTETensorCheck(rng_state);


Do we even support dropout with this rng_state?

wangye805 · 2026-04-27T17:24:42Z

+  workspace_bytes *= fused_attn_rocm::nvte_dtype_size(wkspace->data.dtype);
+  NVTE_CHECK(workspace_bytes >= req_bytes, "nvte_fused_attn_small_seq_bwd: workspace too small.");


nit: now we don't have mixed old ck + new small_seq flow, will we still get more workspace than we need?

wangye805 · 2026-04-27T17:26:50Z

nit: can we have their repo as a 3rdparty then reference their .h. No need to do this now

wangye805 · 2026-04-27T17:32:58Z

+size_t fused_attn_small_seq_bwd_workspace_size(size_t b,
+                                              size_t h_q,
+                                              size_t max_seqlen_kv,
+                                              DType dtype) {
+  constexpr size_t elt_size = 2u;  // BF16 and FP16 are 2 bytes
+  return b * h_q * 1 * std::min(max_seqlen_kv, size_t(16)) * elt_size;
+}


Let's have a comment on what's the actual bwd workspace for, namely, why it's bhq1std::min(max_seqlen_kv, size_t(16))*elt_size?
Probably it's because it's for dS or dP?

wangye805 · 2026-04-27T18:17:29Z

+        del config, result_infos
+        q_spec = get_padded_spec(arg_infos[0])
+        out_sharding = NamedSharding(mesh, PartitionSpec(*q_spec))
+        softmax_aux_sharding = NamedSharding(


Not sure how sharding rule need to be changed

wangye805 · 2026-04-27T18:18:50Z

+        # NVTE uses b = cu_seqlens_q.shape[0] - 1 (one packed segment per slot), not
+        # reduce(batch_shape). E.g. seqpack with max_seqlen_q>1 yields cu length
+        # batch*segments+1 while Q still has leading logical batch only.
+        small_seq_workspace_batch = q_seqlen_or_cu_seqlen_aval.shape[0] - 1


I think at jax side, we don't need to read the actual value of cu_seqlen's.

wangye805 · 2026-04-27T18:26:16Z

+  NVTE_CHECK(bias_batch == 0 && bias_heads == 0,
+             "SmallSeqAttnForwardImpl: bias not supported for small-seq.");
+
+  auto bias_tensor = TensorWrapper(bias, bias_shape, dtype);


Do we support bias?

wangye805 · 2026-04-27T18:28:35Z

+  PrepareSmallSeqAttnForwardAuxTensors(&aux_output_tensors, input_batch, attn_heads, q_max_seqlen,
+                                       kv_max_seqlen, dtype, softmax_aux, rng_state);
+
+  auto dummy_page_table_tensor = TensorWrapper(nullptr, std::vector<size_t>{1}, DType::kInt32);


If we know it's dummy and our nvte_fused_attn_small_seq_fwd do not use it actually, we can skip the passing

Same as other tensors not used, for example rng_state

wangye805 · 2026-04-27T18:30:32Z

+
+  auto bias_tensor = TensorWrapper(bias, bias_shape, dtype);
+
+  if (is_ragged) {


Currently we only support is_ragged. Later we will support BSHD. Please add a TODO comment. Otherwise, we should include all following things into this if branch

Micky774 · 2026-04-27T19:09:00Z

This branch is currently missing commits 7faf099f and 5f592b1a from the original PR.

Micky774

Currently, if I'm understanding this correctly, the only way to utilize the new backend is via the Python-level fused_attn_small_seq entrypoint -- is this desired? We have no means of automatically dispatching to this backend through existing API.

Micky774 · 2026-04-27T20:34:58Z

+  const size_t runtime_s_q = static_cast<size_t>(ck_fused_attn::get_runtime_max_seqlen(
+      b, dev_ptr_cu_seqlens_q, nullptr, workspace, stream));
+  const size_t runtime_s_kv = static_cast<size_t>(ck_fused_attn::get_runtime_max_seqlen(
+      b, dev_ptr_cu_seqlens_kv, nullptr, workspace, stream));


This calls ck_fused_attn::get_runtime_max_seqlen unconditionally and breaks AOTriton-only builds.

Micky774 · 2026-04-27T20:38:07Z

Currently the code in this file unconditionally pulls in tex.fused_attn_small_seq_{f,b}wd -- we should guard these funcs.

Micky774 · 2026-04-27T20:43:25Z

+    outer_primitive = None
+
+    @staticmethod
+    def abstract(


Can we also proactively guard against GQA/MQA here?

Micky774 · 2026-04-27T20:46:01Z

+        )
+
+    @staticmethod
+    def impl(


_fix_len_take, convert_to_2d, seqlen/offset processing are all copied between FusedAttn*Primitive.impl and SmallSeqAttn*Primitive.impl -- let's try to reuse them.

Micky774 · 2026-04-27T20:52:21Z

+void PrepareSmallSeqAttnBackwardAuxTensors(NVTETensorPack *tensor_pack, const size_t input_batch,
+                                         const size_t attn_heads, const size_t q_max_seqlen,
+                                         const size_t kv_max_seqlen, DType dtype, void *softmax_buf,
+                                         void *rng_state_buf) {
+  PrepareSmallSeqAttnForwardAuxTensors(tensor_pack, input_batch, attn_heads, q_max_seqlen,
+                                       kv_max_seqlen, dtype, softmax_buf, rng_state_buf);
+}


This wrapper is trivial, can we forward directly?

Micky774 · 2026-04-27T20:54:42Z

+      b, dev_ptr_cu_seqlens_q, nullptr, workspace, stream));
+  const size_t runtime_s_kv = static_cast<size_t>(ck_fused_attn::get_runtime_max_seqlen(
+      b, dev_ptr_cu_seqlens_kv, nullptr, workspace, stream));
+  if (const char *env_ck = std::getenv("NVTE_LOG_CK_CONFIG");


At this API level we use NVTE_LOG_FUSED_ATTN_CONFIG

VeeraRajasekhar · 2026-04-27T21:10:17Z

This branch is currently missing commits 7faf099f and 5f592b1a from the original PR.

Yes @Micky774, this was intentional. Those missing commits are relevant to some of the corner cases which we decided to not cover.

VeeraRajasekhar · 2026-04-27T21:20:55Z

Currently, if I'm understanding this correctly, the only way to utilize the new backend is via the Python-level fused_attn_small_seq entrypoint -- is this desired? We have no means of automatically dispatching to this backend through existing API.

We cannot guarantee two conditions i.e., to determine max_seq_len_q = 1 and 2<=max_seq_len_kv<=16, which are known at runtime. i.e., these two tests , even though these two donot satisfy the conditions but they will be satisfied on runtime, this is what is blocking us from automatically deploying to this backend. that being said we will throw assertion errors during runtime if these are not met.

Micky774 · 2026-04-27T21:27:12Z

We cannot guarantee two conditions i.e., to determine max_seq_len_q = 1 and 2<=max_seq_len_kv<=16, which are known at runtime. i.e., these two tests , even though these two donot satisfy the conditions but they will be satisfied on runtime, this is what is blocking us from automatically deploying to this backend. that being said we will throw assertion errors during runtime if these are not met.

If we're providing a static entry point anyways, I suspect this means that we expect users to know ahead of time whether their data complies with such a backend. In which case, even mediating with an environment variable should be feasible right? I'm just trying to understand the actual user experience we're trying to support here.

Micky774 · 2026-04-27T21:30:23Z

Alternatively, even just adding max_segment_seqlen_q, max_segment_seqlen_kv as helper descriptors to the SequenceDescriptor spec would suffice to enable an opt-in dispatch that requires user-ownership of data viability.

wangye805 · 2026-04-28T17:01:25Z

Alternatively, even just adding max_segment_seqlen_q, max_segment_seqlen_kv as helper descriptors to the SequenceDescriptor spec would suffice to enable an opt-in dispatch that requires user-ownership of data viability.

Previously I discussed this with Veera and also our customers. There are two three things prevent us using a unified api as previous ck flow:
1). Due to jax's sequence packing, segment length could be quite different from the max_seqlen_q/kv
2). Due to the jit property, runtime sequence length cannot be obtained during the workspace buffer allocation phase
3). This set of new kernels are basically fused-attn which saves P = softmax(S) of shape [b, h, s_q, s_kv], which is different from ck flow which saves log(row sum of exp (S))

So any good ideas you have can work those around?

VeeraRajasekhar and others added 11 commits February 24, 2026 19:12

Addressed comments

b3ef62c

Addressed reviews

db685c4

Guard CK small-seq behind NVTE_FUSED_ATTN_CK_SMALLSEQ=1; add FP16 sup…

b6a5ee8

…port to small-seq kernels

Disabled xla_gpu_graph_level

c737072

Updated XLA_FLAGS in ci/jax.sh

4537cce

Adressed comments

c6e0eae

Refactored input generation for smallseq flow

366945e

[ROCm] apply more strict filtering for just cross-attn and fix the so…

d5afb6f

…ftmax shape bug

VeeraRajasekhar self-assigned this Apr 15, 2026

wangye805 reviewed Apr 16, 2026

View reviewed changes

VeeraRajasekhar added 6 commits April 17, 2026 15:43

feat(jax): pybind registration for ROCm small-seq attention FFI

9dd9b7e

commit_message_stage_f_tests_ci.txt

317f152

Fixed build issues

1104ee2

VeeraRajasekhar marked this pull request as ready for review April 24, 2026 05:38

VeeraRajasekhar requested review from ipanfilo and wenchenvincent as code owners April 24, 2026 05:38

VeeraRajasekhar changed the base branch from veergopu/fused-varlen-ck-smallseq-integration to dev April 24, 2026 05:39

VeeraRajasekhar added 2 commits April 24, 2026 07:10

Fixed small-seq pytests

f4cc5fa

Added seq-packing pytests for small-seq kernels

493b7b4

VeeraRajasekhar added the ci-level 3 CI test level 3 label Apr 25, 2026

Merge branch 'dev' of https://github.com/ROCm/TransformerEngine into …

09ab963

…smallseq-cross-attn-new-backend

VeeraRajasekhar force-pushed the veergopu/smallseq-cross-attn-new-backend branch from 8ace430 to 09ab963 Compare April 25, 2026 02:10

Addressed reviews

98bde78

VeeraRajasekhar requested a review from wangye805 April 25, 2026 02:11

VeeraRajasekhar requested a review from Micky774 April 25, 2026 02:12

Fixed jax/test_fused_attn.py

b183024

wangye805 requested changes Apr 27, 2026

View reviewed changes

Micky774 changed the title ~~[TE] Phase 2 of Sciforium cross-attn integration: a separate cpp backend and a new jax api~~ [TE] Phase 2 of small-seq cross-attn integration: a separate cpp backend and a new jax api Apr 27, 2026

Micky774 requested changes Apr 27, 2026

View reviewed changes

		workspace_bytes *= fused_attn_rocm::nvte_dtype_size(wkspace->data.dtype);
		NVTE_CHECK(workspace_bytes >= req_bytes, "nvte_fused_attn_small_seq_bwd: workspace too small.");


		auto bias_tensor = TensorWrapper(bias, bias_shape, dtype);

		if (is_ragged) {

Conversation

VeeraRajasekhar commented Apr 15, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Micky774 commented Apr 27, 2026

Uh oh!

Micky774 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VeeraRajasekhar commented Apr 27, 2026 • edited by Micky774 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VeeraRajasekhar commented Apr 27, 2026

Uh oh!

Micky774 commented Apr 27, 2026

Uh oh!

Micky774 commented Apr 27, 2026

Uh oh!

wangye805 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

VeeraRajasekhar commented Apr 27, 2026 •

edited by Micky774

Loading