Skip to content

IFU v2.14.dev0#557

Draft
ipanfilo wants to merge 114 commits intodevfrom
IFU-dev-20260315-v2.14
Draft

IFU v2.14.dev0#557
ipanfilo wants to merge 114 commits intodevfrom
IFU-dev-20260315-v2.14

Conversation

@ipanfilo
Copy link
Copy Markdown
Collaborator

Description

IFU from 2026-03-15 upstream commit 708d7c1
https://github.com/ROCm/frameworks-internal/issues/16312

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 30 commits January 20, 2026 09:14
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
* update FE to 1.17

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism flag

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to qa/

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move bias/dbias/versioning/dropout logic to C API

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update qa/L0_pytorch_unittest/test.sh

make .xml file specific to deterministic tests in qa/

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to Jax extension

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to Jax tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update tests/jax/test_fused_attn.py

fix typo

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update transformer_engine/common/fused_attn/fused_attn.cpp

fix indentation

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix the AI fixes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix Jax extension call

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes based on comments

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix selection logic and fwd arg

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix version check in Jax test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix pytorch CI failures

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix Jax CI failures

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix non-/determinism logic and CI

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix formatting

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/fused_attn/fused_attn.cpp

fix and/or logic

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update to 9.18.1 for requirement

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* reduce Jax CI tests for determinism

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* Implemented persistent nvfp4 kernel

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix FP4 guard in ptx

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix in ptx. reduxf32 guard

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per PR review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes per PR review. Added parameter to turn off the persistency

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Modified reference CPU implementation in C++ unit tests to match GPU (numerical truncation). Tightened the numerical tolerance

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Disabled persistency by default, as non-persistent kernel is more performant when inputs are large

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use the tuned kernel also for the rowwise only quantization

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed typo

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Addressed comments from the PR review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Resolved conflicts

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Macros renaming

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
* PoC of the changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Early exit from the Free function for the empty tensor

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Use the proper function for nvtx range

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Only do mark_not_offload when the cpu_offloading is enabled

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* First pass on making the setattr issue not come back

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Actually add pytest.ini

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Changes to __init__

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* A different way

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* WAR the fact that it is not possible to set __setattr__ dynamically

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Simpler solution and fixes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix for the inference mode DPA

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Start of debugging debug tools

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* More fixes in debug

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Speculative moving the validate_name to the constructor

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Making the debug tools names saner

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Change the setattr usage in the tensor parallel group setting

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Adding try/finally - it does not seem to impact the time in observable
way

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixing lint issues and the thunder test

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix 1 of the debug tests

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Removed the warning and enforcement in the CI

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* try-finally in the context manager

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixing the debug tests

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix cb.CUDAOptions usage for Triton 3.6.0

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update utils.py

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update utils.py

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update utils.py

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
return tokens_per_experts always

Signed-off-by: tdophung <tdophung@nvidia.com>
* Update THD sink attention logic for newer cudnn versions

THD Sink attention is supported in 9.18.0

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update thd sink attention logic for cp>1

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add unit test for thd + sink attention

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address comments

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* do not skip thd cp sink attention test

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* disable deterministic mode for sink attention

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* SWA (left, right) with FusedAttention changes cherry-picked from NVIDIA/TransformerEngine#1369

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix test_kv_cache failures

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove unnecessary comments

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix some more filter issues, address feedback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix for local test case failures - `bottom_right_diagonal` should be calculated in `fused_attn_fwd` call as well

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* make conditions more accurate

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add cp tests to test swa (left, right)

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove dead code and make conditions better

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feedback form Charlene

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* small er

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* plumb `bottom_right_diagonal` through jax

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* plumb `bottom_right_diagonal` through jax

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add missing fields

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* use proper mask type in CP

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Use correct block size for workspace in row id map creation, also shard workspace correctly based on 2nd dim of routing_map/row_id map

Signed-off-by: DoubleCheeseCheetos <hanhdp99@gmail.com>

* reduce size of largest test case on single_GPU scenario to fit on L40 and A100 in CI line up

Signed-off-by: tdophung <hanhdp99@gmail.com>

---------

Signed-off-by: DoubleCheeseCheetos <hanhdp99@gmail.com>
Signed-off-by: tdophung <hanhdp99@gmail.com>
Co-authored-by: DoubleCheeseCheetos <hanhdp99@gmail.com>
* Disabled the tuned NVFP4 kernels

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Disabled fast math in cpp tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
* Expose option for custom op fusions

Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add tests for custom ops

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix linter warnings and numerical test failures

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Tweak pattern matching logic with fixed window sizes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use TF32 tols in fused op tests

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestion from @greptile-apps

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Backpropagate fixes from #2622

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(examples): te_llama compatibility with HuggingFace transformers >= 4.57

The te_llama.py example was failing with HuggingFace transformers 4.57+
due to API changes in how decoder layer outputs are handled.

Changes:
- Handle case where hidden_states is passed as a tuple (older HF versions)
- Return tensor directly instead of wrapped in tuple (HF 4.57+ expects this)
- Fix regex pattern to use raw string (fixes SyntaxWarning)

Error fixed:
  AttributeError: 'tuple' object has no attribute 'contiguous'

Tested with:
- transformer_engine 2.5.0
- transformers 4.57.3
- PyTorch container nvcr.io/nvidia/pytorch:25.08-py3

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* docs(te_llama): add requirements.txt

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* fix(docs): add missing notebook output names

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

---------

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>
…x 404 error (#2625)

* Use "nyu-mll/glue" instead of "glue" for encoder datasets to fix 404 error

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* rename mnist dataset path

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add dataset manifest

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* jjit bug fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix'

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lint fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FP8 scale support and fix alignment for grouped GEMM

- Add FP8 scale_inv pointer handling in nvte_grouped_gemm for proper FP8 GEMM
- Fix random padding in tests to ensure 16-byte alignment for all dtypes
- Reorder GroupedGemmSetupWorkspace members for natural alignment
- Remove debug prints

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Grouped GEMM: code cleanup and NULL C support

- Remove unused alignment parameter from GroupedGemmSetupWorkspace::from_buffers
- Simplify select_grouped_operand by removing dead code branches
- Add GroupedOperandSelection.tensor field to avoid passing tensor separately
- Extract set_fp8_scale_pointers and init_matrix_layouts helpers
- Add safety check for FP8 on Hopper column-wise fallback
- Support NULL C tensor when beta=0 (uses D as placeholder)
- Remove unused get_scale_inv() from test
- Add use_null_c test parameter and test case
- Fix documentation: alpha/beta are single element tensors only

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Grouped GEMM: per-matrix alpha/beta support

- Change alpha/beta from single values to per-matrix arrays
- Validate alpha/beta have exactly num_tensors elements
- Update kernel to index alpha_ptr[idx] and beta_ptr[idx]
- Move alpha/beta validation to validate_grouped_gemm_inputs
- Update tests to use per-matrix alpha/beta arrays
- Update documentation

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix alpha/beta numel - use SimpleTensor::numel()

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Refactor: move grouped GEMM to separate file and cleanup API

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Require Blackwell (SM100) and cuBLAS 13.1+ for grouped GEMM

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/gemm/config.h

Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changed

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* suggestions

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactored hopper tensor selection

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
fix wheel

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* version change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ifx

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Code drop: Update recipes documentation and remove custom recipes from low precision training

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Fix SVG css import path for diagrams

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Refactor low_precision_training docs: remove optimizers, fix imports, add GPU checks

Changes:
- Remove optimizer code from all recipe examples (keep only forward/backward)
- Fix Format imports (use Format.E4M3 instead of string 'E4M3')
- Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16)
- Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4
- Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling)
- Add global_shard_guard for TransformerLayer examples in JAX
- Fix fused_layers_jax.py return tuple unpacking
- Update memory_usage JAX examples with dynamic GPU measurement
- Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage)
- Update performance_considerations.rst for JAX differences
- Delete unused .out files and fp8_autocast_jax.py

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix JAX memory usage .out files with correct output

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* responded to comments

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* applied suggestions form greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* year change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* jax compute capability fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Support building with headers from nvidia wheels

There are two changes:
1. `import nvidia` returns a namespace package with `__file__` equal to `None`
2. Add the way to force headers from nvidia wheels. Without that envvar, it's practically impossible with CUDA installed system-wide.

I successfully built the package with torch using the following `uv` configuration:
```
[tool.uv.extra-build-dependencies]
"transformer-engine-torch" = [
    "ninja",
    "nvidia-cuda-crt==13.0.88",
    "nvidia-cuda-cccl==13.0.85",
    { requirement = "torch", match-runtime = true },
    { requirement = "pytorch-triton", match-runtime = true },
    { requirement = "nvidia-cusolver", match-runtime = true },
    { requirement = "nvidia-curand", match-runtime = true },
    { requirement = "nvidia-cublas", match-runtime = true },
    { requirement = "nvidia-cusparse", match-runtime = true },
    { requirement = "nvidia-cudnn-cu13", match-runtime = true },
    { requirement = "nvidia-nvtx", match-runtime = true },
    { requirement = "nvidia-cuda-nvrtc", match-runtime = true },
    { requirement = "nvidia-cuda-runtime", match-runtime = true },
]
```

Signed-off-by: Vadim Markovtsev <vadim@poolside.ai>

* Apply suggestion from @ksivaman

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Vadim Markovtsev <vadim@poolside.ai>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Fixed scaling-factor computation for FP32 to match the reference implementation.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Uncommented the tuned kernel path

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* init

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* year update in license

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
* Rebased to main

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed the year to 2026

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added compilation guards

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added BWD pass

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added dbias and dact tests. Refactoring.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added grouped MXFP8 DACT and ACT API and tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed a typo

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* More fixes from the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Relaxed requirement for last dim from mod128 to mod32

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added alignment checks when tensor descriptors are modified

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
bucket max_b with more granularity when >512

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* expand troubleshooting docs

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update README.rst

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* add grad reduce api for cuda graph hook

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>

* fix code consistency

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>

---------

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* Fix the compilation warnings for the PyTorch extension

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Apply suggestion from @greptile-apps[bot]

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
roycho96 and others added 28 commits March 9, 2026 11:37
…uter backward kernels (#2745)

Remove redundant grad_logits zero-initialization in fused router backward kernels

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
…712)

Enable dequantization from just columnwise data

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
…nput (#2746)

* Fix cross_entropy_forward stride guard for non-contiguous input

Signed-off-by: Bias92 <pewpewplay315@gmail.com>

* Add regression test for non-contiguous transposed input

Signed-off-by: Bias92 <pewpewplay315@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Bias92 <pewpewplay315@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Implemented the kernel with split dbias

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Relaxed constraints on the last dimension

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added notes on group tensor restrictions into documentation

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed pointer

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* More fixes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed kernel grid size

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache size limit

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* added blackwell

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* added blackwell

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Revert build.yml changes unrelated to deploy nightly docs

Restore .github/workflows/build.yml to upstream/main state.
Only deploy_nightly_docs.yml changes are relevant to this PR.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
… vmapped seg offsets (#2692)

* Fix batcher for when segment ids received are batched/vmapped whereas the TE constructed segment pos are not thereby causing mismatches in impl()

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Fix the shape check for assert

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the batcher logic to check for q and kv seg ids separately

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Remove batcher logic to expand segment pos. Keep the shape check asserts.

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add support for vmapped seg id and non vmapped seg pos when computing the seqlens and offsets for fused attn

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Undo batcher check logic for seg pos and seg ids as it is already moved to get_seqlens_and_offsets()

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Remove unnecessary assert check

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Code clean up

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* nit: Use partial instead of a single use function

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…t aligned (#2747)

* fallback

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* warn once

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

---------

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [Debug] Pass tp_size to DebugQuantizer and use it in get_reduction_params

Use tp_size to determine whether tensor parallelism is active instead of
checking tp_group is None (which is ambiguous since None means world group
in torch.distributed). Also add tp_size to the backward-compat kwargs
filtering in call_feature so custom features without tp_size in their
inspect_tensor signature continue to work.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add CPU offloading documentation

Documents the get_cpu_offload_context() API with examples for basic usage,
manual synchronization, and CUDA graphs integration. Adds new
other_optimizations section to the documentation structure.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Fix copyright year to 2026 in cpu_offloading.rst

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Add missing copyright headers to CPU offloading example files

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Improve cpu_offloading docs: legacy params note, manual sync clarifications

- Add comment about legacy parameters in function signature
- Clarify that num_layers is ignored (not forbidden) when manual_synchronization=True
- Document num_layers constraint: must be <= model_layers-2 for overlap
- Add note that offload_stream.synchronize() must precede release_activation_forward_gpu_memory()

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Remove incorrect note about offload_stream.synchronize()

release_activation_forward_gpu_memory() internally waits for offload
completion via CUDA events - explicit synchronize() is not required.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Update docs/features/other_optimizations/cpu_offloading/cpu_offloading.rst

Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* update

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* docs: review fixes for cpu offloading documentation

- Export mark_not_offload and ManualOffloadSynchronizer from
  transformer_engine.pytorch and add them to API reference
- Fix condition 3 description (xi is needed as input to backward,
  not computed after it completes)
- Fix offload_weights docstring default value (False, not True)
- Fix legacy params comment to link to API reference
- Fix RST spacing around inline code in Fig. 3/4 captions
- Add note explaining retain_pinned_cpu_buffers history and
  pytorch#167507 fix landing in PyTorch 2.11
- Fix typo: seqeuences -> sequences
- Add trailing newline to other_optimizations/index.rst

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ling (#2741)

add guard at bisected jax version where lower is segfault

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix pylint: remove unused lru_cache import and fix import order in helper.py

Signed-off-by: tdophung <tdophung@nvidia.com>

* Guard Triton tests against JAX < 0.8.0 using release version check

- Add version_utils.py with is_triton_extension_supported() checking JAX >= 0.8.0
  (release version, not dev snapshot) and TRITON_EXTENSION_MIN_JAX_VERSION constant
- Add pytest.mark.triton marker and conftest hook to skip marked tests on old JAX
- Add require_triton() for module-level skipping in test files
- Rewrite triton_extensions to use is_triton_extension_supported() instead of
  direct jaxlib dev-version comparison

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address review: allow_module_level, drop is_triton_extension_supported re-export, revert test.sh

- require_triton(): add allow_module_level=True to pytest.skip() so module-level
  calls on old JAX produce a proper skip instead of a collection failure
- Remove is_triton_extension_supported from triton_extensions/utils.py __all__:
  importing triton_extensions on JAX < 0.8.0 raises immediately, so re-exporting
  the check from there defeats its purpose; callers should import directly from
  transformer_engine.jax.version_utils
- Revert qa/L0_jax_lint/test.sh TE_PATH to /opt/transformerengine (local dev
  path was accidentally committed; pass TE_PATH= at invocation time instead)

Signed-off-by: tdophung <tdophung@nvidia.com>

* Address review: move version guard before gpu_triton import, fix __all__ and hardcoded version

- Move is_triton_extension_supported() guard before the gpu_triton import block
  with a comment clarifying the segfault is at dispatch time, not import time
- Remove _jax_version_meet_requirement from version_utils __all__ (private helper,
  not a public API; callers import it explicitly as needed)
- Use TRITON_EXTENSION_MIN_JAX_VERSION constant in conftest marker description
  instead of hardcoded '0.8.0'

Signed-off-by: tdophung <tdophung@nvidia.com>

* address more comments

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…(#2751)

* Support configurable number of philox rounds for SR during build

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format and lint

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Added better error messages

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Update transformer_engine/pytorch/distributed.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
* First pass

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Cleaning the dtype usage in dequantize and distributed

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add fake_dtype to get_metadata

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix to make_like

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…e_function_fwd to fp32 (#2752)

* Fixed aval for intermediate results (softmaxed/sigmoided logits) to pass as residuals to CompType, which is currently fp32. This prevents incorrect reading of this buffer when logits dtype used are not fp32

Signed-off-by: tdophung <tdophung@nvidia.com>

* address comments of inconsitency in style and NVTE CHECK for fp32 type

Signed-off-by: tdophung <tdophung@nvidia.com>

* revert the remaining Comptype checking, address greptile suggestion

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Initial commit to pass scale as Tensor for multi_tensor_scale op

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Enable capturable mode for optimizer if store_param_remainders is passed but not actually enabled

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Revert "Enable capturable mode for optimizer if store_param_remainders is passed but not actually enabled"

This reverts commit 74a9bccf0fadd4159f70d28da49a533ea7c76108.

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Apply suggestion from @greptile-apps[bot]

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Change noop_flag to is_infinite

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Update transformer_engine/pytorch/csrc/extensions.h

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Remove duplication

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add test for scale tensor cuda

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* MXFP8 grouped GEMM + tensor-scaled FP8 fixes

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change version to 13.3

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Random padding condition shouldnt be done for mxfp8

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Remove incorrect comment

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* CUBLAS > 13.2 is enough

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* CUBLAS version needed for MXFP8 indeed seems to be 13.3

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Accidental line removal added back. Plus need changes ci t trigger

Add documentation for scaling factors in common.h

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update cuBLAS version requirement for MXFP8 support

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* grouped gemm: address code review comments

- Replace nvte_set/get_grouped_tensor_swizzled_scales with nvte_set_grouped_tensor_param
- Add host-side validation: A and B must use same scaling mode (both MXFP8 or both tensor scaling)
- Add host-side validation: A and B must both be FP8 or both non-FP8; restrict inputs to FP8/BF16
- Restrict output (C/D) to BF16/FP32; remove FP16 from supported types
- Refactor workspace allocation: replace manual offset arithmetic with moving pointer pattern
- Use void* + NVTEScalingMode in setup kernel instead of separate float*/char* scale params
- Extract use_columnwise(swap_dims) helper to eliminate duplicated MXFP8 columnwise blocks
- Split set_fp8_scale_pointers into set_fp8_scale_pointers / set_mxfp8_scale_pointers
- Remove scale_inv_ptrs from GroupedOperandSelection; pass workspace pointers directly
- Move swizzled-scales validation into validate_grouped_gemm_inputs for fail-fast behavior
- Add use_split_accumulator to GroupedMatmulConfig (Hopper only, default false)
- Add FP8 test case with per-tensor scales; add BF16/MXFP8 shape-varying test cases

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…tensors. #2120" (#2673)

* Adds dst.dtype information in copy_ method of quantized tensors.

Signed-off-by: Zhiyi Su <dantesuu@gmail.com>

* Update transformer_engine/pytorch/tensor/quantized_tensor.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: ZhiyiDanielSu <35579247+zobeideThePlayer@users.noreply.github.com>

* Update transformer_engine/pytorch/quantized_tensor.py

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix reference tensor copy

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Zhiyi Su <dantesuu@gmail.com>
Signed-off-by: ZhiyiDanielSu <35579247+zobeideThePlayer@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Zhiyi Su <dantesuu@gmail.com>
Co-authored-by: ZhiyiDanielSu <35579247+zobeideThePlayer@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Fuse scale + 0 + cumulative sum for splits to offsets calc

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add unit test and fix bug in kernel for >256 size

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix race

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* check for logical_last_dim > 0

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* suggestions

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* Added new people to CI

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Removing duplicate

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
… parallelism (#2688)

* Error out if constructing LayerNormLinear with row tensor parallelism

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Disable Userbuffers test for row-TP LayerNormLinear

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Enable cgemm + FP8 tests

* Implement CGEMM + MXFP8

* added size check for mxfp8

* added tols for assertions

* update tests with recipes

* enable tests + is_quantize_recipe_supported

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…kScaling and Float8BlockScaling quantized model init. (#2753)

* Updates FusedAdam with FSDP2 and MXFP8

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* removes xfailing unit test for MXFPr MXFP8

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* addresses comments related to reset parameters and guard against self.capturable

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* adds e2e unit test

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* adds test to non meta device init

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* attempts to add float8block scaling fsdp hooks

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* adds e2e test for Float8BlockScaling

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* addresses review comments and code cleanup

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* more review comments addressed

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* removes unused block_len param

Signed-off-by: Jonathan Mitchell <jomitchell@umb-b300-dp-147.ipp4a1.colossus.nvidia.com>

* fixes failing unit test because we still need to xfail nvfp4 dcp

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lint - replacing todo with note

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com>

---------

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@umb-b300-dp-147.ipp4a1.colossus.nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@umb-b300-dp-147.ipp4a1.colossus.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com>
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* fix for async dcp checkpointing

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

* Apply suggestions from code review

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Peter St. John <peterc.stjohn@gmail.com>

* Update transformer_engine/pytorch/tensor/storage/float8_tensor_storage.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address Greptile review feedback: defensive guards for edge cases

- Add _quantizer None guard in new_empty dispatch
- Replace self.is_cpu with explicit _data/_transpose checks in __reduce_ex__
- Make get_metadata() safe for cleared tensors (both _data and _transpose None)

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

---------

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Signed-off-by: Peter St. John <peterc.stjohn@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…ped Tensor Swizzling (#2669)

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove changes not needed for bf16

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* keep only pytorch binding for now

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* linting error

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* add fast accumulator support

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* MXFP8 grouped GEMM + tensor-scaled FP8 fixes

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change version to 13.3

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* fix the test

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Random padding condition shouldnt be done for mxfp8

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Remove incorrect comment

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* CUBLAS > 13.2 is enough

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* CUBLAS version needed for MXFP8 indeed seems to be 13.3

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* all changes for grouped gemm

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Accidental line removal added back. Plus need changes ci t trigger

Add documentation for scaling factors in common.h

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update cuBLAS version requirement for MXFP8 support

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* grouped gemm: address code review comments

- Replace nvte_set/get_grouped_tensor_swizzled_scales with nvte_set_grouped_tensor_param
- Add host-side validation: A and B must use same scaling mode (both MXFP8 or both tensor scaling)
- Add host-side validation: A and B must both be FP8 or both non-FP8; restrict inputs to FP8/BF16
- Restrict output (C/D) to BF16/FP32; remove FP16 from supported types
- Refactor workspace allocation: replace manual offset arithmetic with moving pointer pattern
- Use void* + NVTEScalingMode in setup kernel instead of separate float*/char* scale params
- Extract use_columnwise(swap_dims) helper to eliminate duplicated MXFP8 columnwise blocks
- Split set_fp8_scale_pointers into set_fp8_scale_pointers / set_mxfp8_scale_pointers
- Remove scale_inv_ptrs from GroupedOperandSelection; pass workspace pointers directly
- Move swizzled-scales validation into validate_grouped_gemm_inputs for fail-fast behavior
- Add use_split_accumulator to GroupedMatmulConfig (Hopper only, default false)
- Add FP8 test case with per-tensor scales; add BF16/MXFP8 shape-varying test cases

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* address reviee comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* missed merged conflict handling

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor change

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* forgot adding a or

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve merge conflicts

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address minor review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* remove unecessary code

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* one line that broke everything :(

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* unecessary

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert caching changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* fix minor bug from greptile

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert for now

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review commentsgp

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@ipanfilo ipanfilo requested review from Micky774 and aris134 April 21, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.