IFU v2.14.dev0 by ipanfilo · Pull Request #557 · ROCm/TransformerEngine

ipanfilo · 2026-04-21T21:17:02Z

Description

IFU from 2026-03-15 upstream commit 708d7c1
https://github.com/ROCm/frameworks-internal/issues/16312

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* update FE to 1.17 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add determinism flag Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add determinism to test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add determinism to qa/ Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move bias/dbias/versioning/dropout logic to C API Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update qa/L0_pytorch_unittest/test.sh make .xml file specific to deterministic tests in qa/ Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add determinism to Jax extension Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add determinism to Jax tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update tests/jax/test_fused_attn.py fix typo Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update transformer_engine/common/fused_attn/fused_attn.cpp fix indentation Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the AI fixes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Jax extension call Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes based on comments Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix selection logic and fwd arg Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix version check in Jax test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix pytorch CI failures Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix Jax CI failures Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix non-/determinism logic and CI Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix formatting Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/fused_attn/fused_attn.cpp fix and/or logic Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to 9.18.1 for requirement Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reduce Jax CI tests for determinism Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Implemented persistent nvfp4 kernel Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix FP4 guard in ptx Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fix in ptx. reduxf32 guard Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes per PR review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes per PR review. Added parameter to turn off the persistency Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Modified reference CPU implementation in C++ unit tests to match GPU (numerical truncation). Tightened the numerical tolerance Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Disabled persistency by default, as non-persistent kernel is more performant when inputs are large Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use the tuned kernel also for the rowwise only quantization Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed typo Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Addressed comments from the PR review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Resolved conflicts Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Macros renaming Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* PoC of the changes Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Early exit from the Free function for the empty tensor Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Use the proper function for nvtx range Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Only do mark_not_offload when the cpu_offloading is enabled Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * First pass on making the setattr issue not come back Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Actually add pytest.ini Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Changes to __init__ Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * A different way Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * WAR the fact that it is not possible to set __setattr__ dynamically Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Simpler solution and fixes Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix for the inference mode DPA Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Start of debugging debug tools Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * More fixes in debug Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Speculative moving the validate_name to the constructor Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Making the debug tools names saner Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Change the setattr usage in the tensor parallel group setting Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Adding try/finally - it does not seem to impact the time in observable way Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fixing lint issues and the thunder test Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix 1 of the debug tests Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Removed the warning and enforcement in the CI Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * try-finally in the context manager Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fixing the debug tests Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix cb.CUDAOptions usage for Triton 3.6.0 Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update utils.py Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update utils.py Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update utils.py Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> --------- Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

return tokens_per_experts always Signed-off-by: tdophung <tdophung@nvidia.com>

* Update THD sink attention logic for newer cudnn versions THD Sink attention is supported in 9.18.0 Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update thd sink attention logic for cp>1 Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add unit test for thd + sink attention Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address comments Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * do not skip thd cp sink attention test Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable deterministic mode for sink attention Signed-off-by: Chen Cui <chcui@nvidia.com> --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* SWA (left, right) with FusedAttention changes cherry-picked from NVIDIA/TransformerEngine#1369 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test_kv_cache failures Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * remove unnecessary comments Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix some more filter issues, address feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix for local test case failures - `bottom_right_diagonal` should be calculated in `fused_attn_fwd` call as well Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * make conditions more accurate Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * add cp tests to test swa (left, right) Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove dead code and make conditions better Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feedback form Charlene Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * small er Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * plumb `bottom_right_diagonal` through jax Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * plumb `bottom_right_diagonal` through jax Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add missing fields Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * use proper mask type in CP Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use correct block size for workspace in row id map creation, also shard workspace correctly based on 2nd dim of routing_map/row_id map Signed-off-by: DoubleCheeseCheetos <hanhdp99@gmail.com> * reduce size of largest test case on single_GPU scenario to fit on L40 and A100 in CI line up Signed-off-by: tdophung <hanhdp99@gmail.com> --------- Signed-off-by: DoubleCheeseCheetos <hanhdp99@gmail.com> Signed-off-by: tdophung <hanhdp99@gmail.com> Co-authored-by: DoubleCheeseCheetos <hanhdp99@gmail.com>

* Disabled the tuned NVFP4 kernels Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Disabled fast math in cpp tests Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

@greptile-apps

* Expose option for custom op fusions Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add tests for custom ops Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings and numerical test failures Signed-off-by: Tim Moon <tmoon@nvidia.com> * Tweak pattern matching logic with fixed window sizes Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use TF32 tols in fused op tests Signed-off-by: Tim Moon <tmoon@nvidia.com> * Review suggestion from @greptile-apps Signed-off-by: Tim Moon <tmoon@nvidia.com> * Backpropagate fixes from #2622 Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix(examples): te_llama compatibility with HuggingFace transformers >= 4.57 The te_llama.py example was failing with HuggingFace transformers 4.57+ due to API changes in how decoder layer outputs are handled. Changes: - Handle case where hidden_states is passed as a tuple (older HF versions) - Return tensor directly instead of wrapped in tuple (HF 4.57+ expects this) - Fix regex pattern to use raw string (fixes SyntaxWarning) Error fixed: AttributeError: 'tuple' object has no attribute 'contiguous' Tested with: - transformer_engine 2.5.0 - transformers 4.57.3 - PyTorch container nvcr.io/nvidia/pytorch:25.08-py3 Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com> * docs(te_llama): add requirements.txt Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com> * fix(docs): add missing notebook output names Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com> --------- Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

…x 404 error (#2625) * Use "nyu-mll/glue" instead of "glue" for encoder datasets to fix 404 error Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * rename mnist dataset path Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * add dataset manifest Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* jjit bug fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix' Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* code drop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add FP8 scale support and fix alignment for grouped GEMM - Add FP8 scale_inv pointer handling in nvte_grouped_gemm for proper FP8 GEMM - Fix random padding in tests to ensure 16-byte alignment for all dtypes - Reorder GroupedGemmSetupWorkspace members for natural alignment - Remove debug prints Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Grouped GEMM: code cleanup and NULL C support - Remove unused alignment parameter from GroupedGemmSetupWorkspace::from_buffers - Simplify select_grouped_operand by removing dead code branches - Add GroupedOperandSelection.tensor field to avoid passing tensor separately - Extract set_fp8_scale_pointers and init_matrix_layouts helpers - Add safety check for FP8 on Hopper column-wise fallback - Support NULL C tensor when beta=0 (uses D as placeholder) - Remove unused get_scale_inv() from test - Add use_null_c test parameter and test case - Fix documentation: alpha/beta are single element tensors only Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Grouped GEMM: per-matrix alpha/beta support - Change alpha/beta from single values to per-matrix arrays - Validate alpha/beta have exactly num_tensors elements - Update kernel to index alpha_ptr[idx] and beta_ptr[idx] - Move alpha/beta validation to validate_grouped_gemm_inputs - Update tests to use per-matrix alpha/beta arrays - Update documentation Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix alpha/beta numel - use SimpleTensor::numel() Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Refactor: move grouped GEMM to separate file and cleanup API Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Require Blackwell (SM100) and cuBLAS 13.1+ for grouped GEMM Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/gemm/config.h Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * changed Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * suggestions Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactored hopper tensor selection Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>

fix wheel Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* version change Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ifx Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Code drop: Update recipes documentation and remove custom recipes from low precision training Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Fix SVG css import path for diagrams Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Refactor low_precision_training docs: remove optimizers, fix imports, add GPU checks Changes: - Remove optimizer code from all recipe examples (keep only forward/backward) - Fix Format imports (use Format.E4M3 instead of string 'E4M3') - Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16) - Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4 - Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling) - Add global_shard_guard for TransformerLayer examples in JAX - Fix fused_layers_jax.py return tuple unpacking - Update memory_usage JAX examples with dynamic GPU measurement - Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage) - Update performance_considerations.rst for JAX differences - Delete unused .out files and fp8_autocast_jax.py Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix JAX memory usage .out files with correct output Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * responded to comments Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * applied suggestions form greptile Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * year change Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * jax compute capability fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

@ksivaman

* Support building with headers from nvidia wheels There are two changes: 1. `import nvidia` returns a namespace package with `__file__` equal to `None` 2. Add the way to force headers from nvidia wheels. Without that envvar, it's practically impossible with CUDA installed system-wide. I successfully built the package with torch using the following `uv` configuration: ``` [tool.uv.extra-build-dependencies] "transformer-engine-torch" = [ "ninja", "nvidia-cuda-crt==13.0.88", "nvidia-cuda-cccl==13.0.85", { requirement = "torch", match-runtime = true }, { requirement = "pytorch-triton", match-runtime = true }, { requirement = "nvidia-cusolver", match-runtime = true }, { requirement = "nvidia-curand", match-runtime = true }, { requirement = "nvidia-cublas", match-runtime = true }, { requirement = "nvidia-cusparse", match-runtime = true }, { requirement = "nvidia-cudnn-cu13", match-runtime = true }, { requirement = "nvidia-nvtx", match-runtime = true }, { requirement = "nvidia-cuda-nvrtc", match-runtime = true }, { requirement = "nvidia-cuda-runtime", match-runtime = true }, ] ``` Signed-off-by: Vadim Markovtsev <vadim@poolside.ai> * Apply suggestion from @ksivaman Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Vadim Markovtsev <vadim@poolside.ai> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixed scaling-factor computation for FP32 to match the reference implementation. Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Uncommented the tuned kernel path Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* init Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * year update in license Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update README.rst Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update README.rst Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update README.rst Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> --------- Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

* Rebased to main Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed the year to 2026 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added compilation guards Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added BWD pass Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added dbias and dact tests. Refactoring. Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added grouped MXFP8 DACT and ACT API and tests Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed a typo Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * More fixes from the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Relaxed requirement for last dim from mod128 to mod32 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added alignment checks when tensor descriptors are modified Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>

bucket max_b with more granularity when >512 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* expand troubleshooting docs Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Update README.rst Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update README.rst Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update README.rst Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* add grad reduce api for cuda graph hook Signed-off-by: Pingtian Li <pingtianl@nvidia.com> * fix code consistency Signed-off-by: Pingtian Li <pingtianl@nvidia.com> --------- Signed-off-by: Pingtian Li <pingtianl@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

@greptile-apps

* Fix the compilation warnings for the PyTorch extension Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Apply suggestion from @greptile-apps[bot] Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com> --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

…uter backward kernels (#2745) Remove redundant grad_logits zero-initialization in fused router backward kernels Signed-off-by: Sung Hyun Cho <hope5487@gmail.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

…712) Enable dequantization from just columnwise data Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

…nput (#2746) * Fix cross_entropy_forward stride guard for non-contiguous input Signed-off-by: Bias92 <pewpewplay315@gmail.com> * Add regression test for non-contiguous transposed input Signed-off-by: Bias92 <pewpewplay315@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Bias92 <pewpewplay315@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Implemented the kernel with split dbias Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Relaxed constraints on the last dimension Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added notes on group tensor restrictions into documentation Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed pointer Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * More fixes Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed kernel grid size Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache size limit Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * ccache Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * added blackwell Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * added blackwell Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Revert build.yml changes unrelated to deploy nightly docs Restore .github/workflows/build.yml to upstream/main state. Only deploy_nightly_docs.yml changes are relevant to this PR. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

… vmapped seg offsets (#2692) * Fix batcher for when segment ids received are batched/vmapped whereas the TE constructed segment pos are not thereby causing mismatches in impl() Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: Fix the shape check for assert Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the batcher logic to check for q and kv seg ids separately Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Remove batcher logic to expand segment pos. Keep the shape check asserts. Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Add support for vmapped seg id and non vmapped seg pos when computing the seqlens and offsets for fused attn Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Undo batcher check logic for seg pos and seg ids as it is already moved to get_seqlens_and_offsets() Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: Remove unnecessary assert check Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: Code clean up Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Use partial instead of a single use function Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…t aligned (#2747) * fallback Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * warn once Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> --------- Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* code drop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [Debug] Pass tp_size to DebugQuantizer and use it in get_reduction_params Use tp_size to determine whether tensor parallelism is active instead of checking tp_group is None (which is ambiguous since None means world group in torch.distributed). Also add tp_size to the backward-compat kwargs filtering in call_feature so custom features without tp_size in their inspect_tensor signature continue to work. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add CPU offloading documentation Documents the get_cpu_offload_context() API with examples for basic usage, manual synchronization, and CUDA graphs integration. Adds new other_optimizations section to the documentation structure. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Fix copyright year to 2026 in cpu_offloading.rst Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Add missing copyright headers to CPU offloading example files Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Improve cpu_offloading docs: legacy params note, manual sync clarifications - Add comment about legacy parameters in function signature - Clarify that num_layers is ignored (not forbidden) when manual_synchronization=True - Document num_layers constraint: must be <= model_layers-2 for overlap - Add note that offload_stream.synchronize() must precede release_activation_forward_gpu_memory() Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Remove incorrect note about offload_stream.synchronize() release_activation_forward_gpu_memory() internally waits for offload completion via CUDA events - explicit synchronize() is not required. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Update docs/features/other_optimizations/cpu_offloading/cpu_offloading.rst Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * update Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * docs: review fixes for cpu offloading documentation - Export mark_not_offload and ManualOffloadSynchronizer from transformer_engine.pytorch and add them to API reference - Fix condition 3 description (xi is needed as input to backward, not computed after it completes) - Fix offload_weights docstring default value (False, not True) - Fix legacy params comment to link to API reference - Fix RST spacing around inline code in Fig. 3/4 captions - Add note explaining retain_pinned_cpu_buffers history and pytorch#167507 fix landing in PyTorch 2.11 - Fix typo: seqeuences -> sequences - Add trailing newline to other_optimizations/index.rst Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ling (#2741) add guard at bisected jax version where lower is segfault Signed-off-by: tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix pylint: remove unused lru_cache import and fix import order in helper.py Signed-off-by: tdophung <tdophung@nvidia.com> * Guard Triton tests against JAX < 0.8.0 using release version check - Add version_utils.py with is_triton_extension_supported() checking JAX >= 0.8.0 (release version, not dev snapshot) and TRITON_EXTENSION_MIN_JAX_VERSION constant - Add pytest.mark.triton marker and conftest hook to skip marked tests on old JAX - Add require_triton() for module-level skipping in test files - Rewrite triton_extensions to use is_triton_extension_supported() instead of direct jaxlib dev-version comparison Signed-off-by: tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review: allow_module_level, drop is_triton_extension_supported re-export, revert test.sh - require_triton(): add allow_module_level=True to pytest.skip() so module-level calls on old JAX produce a proper skip instead of a collection failure - Remove is_triton_extension_supported from triton_extensions/utils.py __all__: importing triton_extensions on JAX < 0.8.0 raises immediately, so re-exporting the check from there defeats its purpose; callers should import directly from transformer_engine.jax.version_utils - Revert qa/L0_jax_lint/test.sh TE_PATH to /opt/transformerengine (local dev path was accidentally committed; pass TE_PATH= at invocation time instead) Signed-off-by: tdophung <tdophung@nvidia.com> * Address review: move version guard before gpu_triton import, fix __all__ and hardcoded version - Move is_triton_extension_supported() guard before the gpu_triton import block with a comment clarifying the segfault is at dispatch time, not import time - Remove _jax_version_meet_requirement from version_utils __all__ (private helper, not a public API; callers import it explicitly as needed) - Use TRITON_EXTENSION_MIN_JAX_VERSION constant in conftest marker description instead of hardcoded '0.8.0' Signed-off-by: tdophung <tdophung@nvidia.com> * address more comments Signed-off-by: tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: tdophung <tdophung@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…(#2751) * Support configurable number of philox rounds for SR during build Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format and lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Added better error messages Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Update transformer_engine/pytorch/distributed.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes Signed-off-by: Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

* First pass Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Cleaning the dtype usage in dequantize and distributed Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add fake_dtype to get_metadata Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix to make_like Signed-off-by: Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…e_function_fwd to fp32 (#2752) * Fixed aval for intermediate results (softmaxed/sigmoided logits) to pass as residuals to CompType, which is currently fp32. This prevents incorrect reading of this buffer when logits dtype used are not fp32 Signed-off-by: tdophung <tdophung@nvidia.com> * address comments of inconsitency in style and NVTE CHECK for fp32 type Signed-off-by: tdophung <tdophung@nvidia.com> * revert the remaining Comptype checking, address greptile suggestion Signed-off-by: tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: tdophung <tdophung@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

@greptile-apps

* Initial commit to pass scale as Tensor for multi_tensor_scale op Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Enable capturable mode for optimizer if store_param_remainders is passed but not actually enabled Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Enable capturable mode for optimizer if store_param_remainders is passed but not actually enabled" This reverts commit 74a9bccf0fadd4159f70d28da49a533ea7c76108. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Apply suggestion from @greptile-apps[bot] Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change noop_flag to is_infinite Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Update transformer_engine/pytorch/csrc/extensions.h Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove duplication Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add test for scale tensor cuda Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* MXFP8 grouped GEMM + tensor-scaled FP8 fixes Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change version to 13.3 Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * Random padding condition shouldnt be done for mxfp8 Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * Remove incorrect comment Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * CUBLAS > 13.2 is enough Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * CUBLAS version needed for MXFP8 indeed seems to be 13.3 Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * Accidental line removal added back. Plus need changes ci t trigger Add documentation for scaling factors in common.h Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * Update cuBLAS version requirement for MXFP8 support Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * grouped gemm: address code review comments - Replace nvte_set/get_grouped_tensor_swizzled_scales with nvte_set_grouped_tensor_param - Add host-side validation: A and B must use same scaling mode (both MXFP8 or both tensor scaling) - Add host-side validation: A and B must both be FP8 or both non-FP8; restrict inputs to FP8/BF16 - Restrict output (C/D) to BF16/FP32; remove FP16 from supported types - Refactor workspace allocation: replace manual offset arithmetic with moving pointer pattern - Use void* + NVTEScalingMode in setup kernel instead of separate float*/char* scale params - Extract use_columnwise(swap_dims) helper to eliminate duplicated MXFP8 columnwise blocks - Split set_fp8_scale_pointers into set_fp8_scale_pointers / set_mxfp8_scale_pointers - Remove scale_inv_ptrs from GroupedOperandSelection; pass workspace pointers directly - Move swizzled-scales validation into validate_grouped_gemm_inputs for fail-fast behavior - Add use_split_accumulator to GroupedMatmulConfig (Hopper only, default false) - Add FP8 test case with per-tensor scales; add BF16/MXFP8 shape-varying test cases Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: vthumbe1503 <vthumbe@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…tensors. #2120" (#2673) * Adds dst.dtype information in copy_ method of quantized tensors. Signed-off-by: Zhiyi Su <dantesuu@gmail.com> * Update transformer_engine/pytorch/tensor/quantized_tensor.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: ZhiyiDanielSu <35579247+zobeideThePlayer@users.noreply.github.com> * Update transformer_engine/pytorch/quantized_tensor.py Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix reference tensor copy Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Zhiyi Su <dantesuu@gmail.com> Signed-off-by: ZhiyiDanielSu <35579247+zobeideThePlayer@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Zhiyi Su <dantesuu@gmail.com> Co-authored-by: ZhiyiDanielSu <35579247+zobeideThePlayer@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fuse scale + 0 + cumulative sum for splits to offsets calc Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add unit test and fix bug in kernel for >256 size Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix race Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * check for logical_last_dim > 0 Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * suggestions Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Added new people to CI Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Removing duplicate Signed-off-by: Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

… parallelism (#2688) * Error out if constructing LayerNormLinear with row tensor parallelism Signed-off-by: Tim Moon <tmoon@nvidia.com> * Disable Userbuffers test for row-TP LayerNormLinear Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Enable cgemm + FP8 tests * Implement CGEMM + MXFP8 * added size check for mxfp8 * added tols for assertions * update tests with recipes * enable tests + is_quantize_recipe_supported Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…kScaling and Float8BlockScaling quantized model init. (#2753) * Updates FusedAdam with FSDP2 and MXFP8 Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * removes xfailing unit test for MXFPr MXFP8 Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * addresses comments related to reset parameters and guard against self.capturable Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * adds e2e unit test Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * adds test to non meta device init Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * attempts to add float8block scaling fsdp hooks Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * adds e2e test for Float8BlockScaling Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * addresses review comments and code cleanup Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * more review comments addressed Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * removes unused block_len param Signed-off-by: Jonathan Mitchell <jomitchell@umb-b300-dp-147.ipp4a1.colossus.nvidia.com> * fixes failing unit test because we still need to xfail nvfp4 dcp Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint - replacing todo with note Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com> --------- Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> Signed-off-by: Jonathan Mitchell <jomitchell@umb-b300-dp-147.ipp4a1.colossus.nvidia.com> Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@umb-b300-dp-147.ipp4a1.colossus.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: vthumbe1503 <vthumbe@nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com>

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

* fix for async dcp checkpointing Signed-off-by: Peter St. John <pstjohn@nvidia.com> * Apply suggestions from code review Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Peter St. John <peterc.stjohn@gmail.com> * Update transformer_engine/pytorch/tensor/storage/float8_tensor_storage.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address Greptile review feedback: defensive guards for edge cases - Add _quantizer None guard in new_empty dispatch - Replace self.is_cpu with explicit _data/_transpose checks in __reduce_ex__ - Make get_metadata() safe for cleared tensors (both _data and _transpose None) Signed-off-by: Peter St. John <pstjohn@nvidia.com> --------- Signed-off-by: Peter St. John <pstjohn@nvidia.com> Signed-off-by: Peter St. John <peterc.stjohn@gmail.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…ped Tensor Swizzling (#2669) Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove changes not needed for bf16 Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * keep only pytorch binding for now Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * linting error Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * add fast accumulator support Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * MXFP8 grouped GEMM + tensor-scaled FP8 fixes Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change version to 13.3 Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * fix the test Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Random padding condition shouldnt be done for mxfp8 Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * Remove incorrect comment Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * CUBLAS > 13.2 is enough Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * CUBLAS version needed for MXFP8 indeed seems to be 13.3 Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * all changes for grouped gemm Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Accidental line removal added back. Plus need changes ci t trigger Add documentation for scaling factors in common.h Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * Update cuBLAS version requirement for MXFP8 support Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * grouped gemm: address code review comments - Replace nvte_set/get_grouped_tensor_swizzled_scales with nvte_set_grouped_tensor_param - Add host-side validation: A and B must use same scaling mode (both MXFP8 or both tensor scaling) - Add host-side validation: A and B must both be FP8 or both non-FP8; restrict inputs to FP8/BF16 - Restrict output (C/D) to BF16/FP32; remove FP16 from supported types - Refactor workspace allocation: replace manual offset arithmetic with moving pointer pattern - Use void* + NVTEScalingMode in setup kernel instead of separate float*/char* scale params - Extract use_columnwise(swap_dims) helper to eliminate duplicated MXFP8 columnwise blocks - Split set_fp8_scale_pointers into set_fp8_scale_pointers / set_mxfp8_scale_pointers - Remove scale_inv_ptrs from GroupedOperandSelection; pass workspace pointers directly - Move swizzled-scales validation into validate_grouped_gemm_inputs for fail-fast behavior - Add use_split_accumulator to GroupedMatmulConfig (Hopper only, default false) - Add FP8 test case with per-tensor scales; add BF16/MXFP8 shape-varying test cases Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * address reviee comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * missed merged conflict handling Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor change Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * forgot adding a or Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve merge conflicts Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address minor review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * remove unecessary code Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * one line that broke everything :( Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unecessary Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * revert caching changes Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * fix minor bug from greptile Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * revert for now Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review commentsgp Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…ng on merge

ptrendx and others added 30 commits January 20, 2026 09:14

Changed VERSION to 2.13.0.dev0

dfdd382

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fixed the year to 2026 (#2611)

36f4e45

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Permutation to always return group_size/tokens_per_expert (#2613)

3d46bf6

return tokens_per_experts always Signed-off-by: tdophung <tdophung@nvidia.com>

[Pytorch] Fix wheel test (#2635)

f8cca8b

fix wheel Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Fix exp2f_rcp to properly handle nan and 0xFE cases (#2647)

71971e3

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

[Common] Bucket batch size with higher granularity for THD (#2653)

dccf67e

bucket max_b with more granularity when >512 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

[PyTorch Debug] Skip logging stats if unsupported (#2652)

b841243

fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

roycho96 and others added 28 commits March 9, 2026 11:37

[Common] Remove redundant grad_logits zero-initialization in fused ro…

6e0085a

…uter backward kernels (#2745) Remove redundant grad_logits zero-initialization in fused router backward kernels Signed-off-by: Sung Hyun Cho <hope5487@gmail.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

Enable dequantization from MXFP8 tensor with only columnwise data (#2…

f64941a

…712) Enable dequantization from just columnwise data Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[NVFP4][MOE] Add unfused quantization fallback when input shape is no…

61d5865

…t aligned (#2747) * fallback Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * warn once Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> --------- Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

[PyTorch] Fix fuser so it releases tensors properly (#2750)

c021e7e

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

Added new users to CI (#2756)

a5d7464

* Added new people to CI Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Removing duplicate Signed-off-by: Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

add .claude to gitignore (#2762)

306e853

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

Merge commit '708d7c160ad6b2bf44c9c597083d4cbb4860f068' from upstream

2156e61

Resovle merging errors, fixed build and load, restore codepaths missi…

11ab82a

…ng on merge

ipanfilo requested review from Micky774 and aris134 April 21, 2026 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IFU v2.14.dev0#557

IFU v2.14.dev0#557
ipanfilo wants to merge 114 commits intodevfrom
IFU-dev-20260315-v2.14

ipanfilo commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ipanfilo commented Apr 21, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants