Integrate AITER fused RoPE kernels with fallback to TE native#541
Open
Integrate AITER fused RoPE kernels with fallback to TE native#541
Conversation
2b8c7d4 to
085a9c1
Compare
Add optional AITER RoPE dispatch path in FusedRoPEFunc for improved performance on ROCm/AMD GPUs. When aiter is installed and the input meets the supported subset (sbhd format, non-interleaved, no context parallelism, no packed sequences, no start_positions), the forward and backward passes dispatch to aiter.ops.rope.rope_fwd / rope_bwd. Fallback to the existing tex.fused_rope_forward / tex.fused_rope_backward is automatic for all other configurations and when AITER is not available. A new env var NVTE_USE_AITER_ROPE (default "1") allows explicit opt-out. The AITER import is gated behind IS_HIP_EXTENSION to avoid unnecessary import attempts on CUDA systems. Add unit tests for AITER-vs-TE numerical parity, guard logic coverage, env var disable behavior, and fallback on unsupported configurations. Tested in MLPerf GPT-OSS-20B MoE pretraining on MI355X (8xGPU). Signed-off-by: Su Ann Chong <suachong@amd.com> Made-with: Cursor Signed-off-by: Su Ann Chong <suachong@amd.com> Made-with: Cursor Signed-off-by: Su Ann Chong <suachong@amd.com> Made-with: Cursor
085a9c1 to
8b05a6a
Compare
ipanfilo
requested changes
Apr 18, 2026
|
|
||
| if IS_HIP_EXTENSION: | ||
| try: | ||
| from aiter.ops.rope import ( # pylint: disable=import-error |
Collaborator
There was a problem hiding this comment.
Is there any AITER versioning that can be used to constrain using of the API?
- Add AMD copyright header to rope.py - Check IS_HIP_EXTENSION first, guard all AITER code behind it - Use logger.warning for AITER import failures instead of logger.info - Log AITER version (via aiter._version) on successful import - Default NVTE_USE_AITER_ROPE to "0" (opt-in) since CI cannot test it - Expose _HAVE_AITER_ROPE via FusedRoPEFunc.has_aiter_rope() method - Use @pytest.mark.skipif decorator instead of inline pytest.skip() Signed-off-by: Su Ann Chong <suachong@amd.com> Made-with: Cursor
Signed-off-by: Su Ann Chong <suachong@amd.com> Made-with: Cursor
Provides a containerized way to test the AITER fused RoPE integration on ROCm systems, since CI cannot test this feature. Signed-off-by: Su Ann Chong <suachong@amd.com> Made-with: Cursor
Micky774
requested changes
Apr 21, 2026
Local testing infrastructure, not intended for the repository. Signed-off-by: Su Ann Chong <suachong@amd.com> Made-with: Cursor
Follow existing convention: add AMD copyright above the NVIDIA header with "modified for portability to AMDGPU" note, rather than replacing it. Signed-off-by: Su Ann Chong <suachong@amd.com> Made-with: Cursor
Micky774
requested changes
Apr 23, 2026
- Replace logger.warning with RuntimeError when NVTE_USE_AITER_ROPE=1 but AITER import fails, making the failure explicit instead of silently falling back to TE native kernels - Remove all diagnostic logging (version info, reason tracking) to reduce maintenance burden and stay synchronized with upstream Signed-off-by: Su Ann Chong <suachong@amd.com> Made-with: Cursor
Micky774
reviewed
Apr 23, 2026
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
Micky774
approved these changes
Apr 23, 2026
wangye805
requested changes
Apr 23, 2026
wangye805
requested changes
Apr 23, 2026
- Guard `import os` and env var check under IS_HIP_EXTENSION in rope.py
to minimize upstream diff
- Add AMD copyright header to test_fused_rope.py
- Guard `unittest.mock` and `FusedRoPEFunc` imports behind IS_HIP_EXTENSION
- Add IS_HIP_EXTENSION skipif guard to all AITER test functions
- Use torch.device("cuda") instead of hardcoding cuda:0 in AITER test
Signed-off-by: Su Ann Chong <suachong@amd.com>
Made-with: Cursor
wangye805
approved these changes
Apr 23, 2026
ipanfilo
reviewed
Apr 23, 2026
| apply_fused_qkv_rotary_pos_emb, | ||
| ) | ||
|
|
||
| try: |
Collaborator
There was a problem hiding this comment.
Please refer other modules. IS_HIP_EXTENSION importing does not require try/catch
Follow repo convention: import IS_HIP_EXTENSION directly from torch.utils.cpp_extension without try/except guard, consistent with all other test modules. Signed-off-by: Su Ann Chong <suachong@amd.com> Made-with: Cursor
Micky774
reviewed
Apr 24, 2026
Comment on lines
+16
to
+19
| try: | ||
| from torch.utils.cpp_extension import IS_HIP_EXTENSION | ||
| except ImportError: | ||
| IS_HIP_EXTENSION = False |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Integrate AITER's optimized HIP/ASM RoPE kernels into TE's
FusedRoPEFuncon ROCm.When
aiteris installed and the input meets the supported subset (sbhd format,non-interleaved, no context parallelism, no packed sequences, no start_positions),
the forward and backward passes dispatch to
aiter.ops.rope.rope_fwd/rope_bwdfor improved performance on AMD GPUs.Fallback to the existing
tex.fused_rope_forward/tex.fused_rope_backwardis automatic for all other configurations, and when AITER is not available.
A new env var
NVTE_USE_AITER_ROPE(default"1") allows explicit opt-out.The AITER import is gated behind
IS_HIP_EXTENSIONto avoid unnecessary importattempts on CUDA systems.
Tested in MLPerf GPT-OSS-20B MoE pretraining on MI355X (8×GPU).
Type of change
Changes
except Exception) andIS_HIP_EXTENSIONgateNVTE_USE_AITER_ROPEenv var (default"1") to allow opt-outFusedRoPEFunc._can_use_aiter()guard that restricts AITER dispatch to sbhd format, non-interleaved, no CP, no THD, no start_positionsaiter.ops.rope.rope_fwd/rope_bwdin forward/backward when guard passes; fall back totex.fused_rope_*otherwisetest_aiter_rope_matches_te_fused: verifies AITER and TE fused produce identical output and gradients (parametrized over dtype, seq_length, hidden_size, rotary_percent, loss_func)test_aiter_rope_can_use_guard: exhaustive unit test of guard logic (6 parametrized cases)test_aiter_rope_env_var_disable: verifies_HAVE_AITER_ROPE=Falsedisables dispatchtest_aiter_rope_fallback_unsupported: verifies unsupported configs fall back correctlyChecklist: