Added initial AI Agent instructions and skills by Micky774 · Pull Request #448 · ROCm/TransformerEngine

Micky774 · 2026-02-12T17:27:07Z

Description

Includes an initial addition of repository-level AI agent instructions/context via CLAUDE.md as well as example skills via .claude/**/SKILL.md. This mainly serves as a demonstration of how to add additional context to AI coding agents, as well as how to develop a reasonably-complex skill.

TODO: Back-test against old cases and refine as needed

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Adds CLAUDE.md
Adds .claude/ck-debugging/SKILL.md
Adds .claude/ifu-merge/SKILL.md

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

ipanfilo · 2026-03-26T19:35:32Z

+---
+name: ifu-merge
+description: >
+  Guide for performing IFU (Internal Feature Update) merges on the TransformerEngine ROCm fork.


IFU stands for intergrate from upstream

Not updated yet :-), did you forget to commit your changes?

ipanfilo · 2026-03-26T19:39:25Z

+- Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds.
+
+**Rules that follow:**
+- Never edit `*_hip.cpp` or `.hip` files — they are regenerated from source files


We have one exception of .hip file in the repo. Maybe we can rename it for consistency

What's the exception? Renaming it would probably be best.

transformer_engine/common/rocshmem_api/rocshmem_waitkernel.hip this file is excluded from hipification. I agree, foc consistency with the rest of code we can rename it to *.cpp (not to *.cu because we hipify all *.cu) @alextmagro

ipanfilo · 2026-03-26T19:40:05Z

+|---|---|---|
+| PyTorch CSRC (`.cpp` source files) | `#ifdef USE_ROCM` / `#ifndef USE_ROCM` | DeviceGuard, scale swizzling |
+| Common layer (`.cu` files that get hipified) | `#ifdef __HIP_PLATFORM_AMD__` | Warp masks, kernel dispatch |
+| Python code | `IS_HIP_EXTENSION` (from `torch.utils.cpp_extension`) | Workspace sizing, feature flags |


Also guard for JAX Python code

ipanfilo · 2026-03-26T19:43:37Z

+git diff <rocm-parent>..<upstream-parent> --stat
+
+# Check for removed guards
+git diff <rocm-parent>..<upstream-parent> -- <file> | grep -E "^-.*(__HIP_PLATFORM_AMD__|USE_ROCM|IS_HIP_EXTENSION)"


I would also add "ROCm" and "upstream" - those are comments that indicate some changes made by us

ipanfilo · 2026-03-26T19:47:56Z

+
+5. **Convention Changes**: Upstream changes a data format, tensor shape, or API contract without any code conflict. Every downstream consumer of that convention must be updated manually — the compiler won't catch these.
+
+**How to systematically audit:**


Should "what to pay attention to" points be here?
We have two big semantic differences:
__shfl vs __shfl_sync and other lane communication built-ins
fp8 data types: i.e. torch.float8_e4m3fn vs get_torch_e4m3_type, etc.

I think that's something that is well-structured and systematic enough that it is "obvious" to Claude when working with the relevant files. Maybe it won't pick up the exact semantics, but it would pick up its functionality and consequences.

Should "what to pay attention to" points be here? We have two big semantic differences: __shfl vs __shfl_sync and other lane communication built-ins fp8 data types: i.e. torch.float8_e4m3fn vs get_torch_e4m3_type, etc.

@ipanfilo I feel these are not just for IFUs. Maybe should be put in CLAUDE.md

It may definitely be generic rule of thumb not IFU only. I do not think that this thing is obvious to always follow, especially fp8 semantics - in many cases compile time constants cannot be used but runtime detection is required

ipanfilo · 2026-04-10T15:53:57Z

I have general comment. Because different agents use approximately the same skill but different default location, I asked some of them and the recommendation is to keep skills in some agent agnostic location like .ai-skills and make symlink(-s) to what is actually used on client side

Micky774 · 2026-04-10T16:36:41Z

I have general comment. Because different agents use approximately the same skill but different default location, I asked some of them and the recommendation is to keep skills in some agent agnostic location like .ai-skills and make symlink(-s) to what is actually used on client side

I was just chatting with Wen about this the other day but I think initially we're going to try to focus on supporting Claude Code since many other frameworks explicitly support claude's file locations, and because instruction optimization does genuinely differ framework-to-framework so narrowing down on one will help optimize for reliable and powerful workflows.

@VeeraRajasekhar has also been looking at alternative ways to keep things generic, and has found a project that aims to automatically configure per-framework skill installs as you have mentioned, so we can also explore that in the future too.

Plus as we migrate to a plugin based approach, that'll be a bit easier for folks to generically install across frameworks.

wangye805

Overall looks good to me. But generally how to use those mds? Do I need to import something when I start claude inside a docker container?

wangye805 · 2026-04-13T15:13:43Z

+The function in `fused_attn_ck.cpp:23-152` applies these filters in order. When CK is rejected, `NVTE_LOG_CK_CONFIG=1` prints the reason. The filters are:
+
+1. **GQA groups**: `num_gqa_groups > 0` and `num_attn_heads % num_gqa_groups == 0`
+2. **Data type**: `q_dtype == kv_dtype` and both are fp16 or bf16 (no fp8)
+3. **Bias type**: only `NO_BIAS`, `ALIBI`, or `POST_SCALE_BIAS` (no `PRE_SCALE_BIAS`)
+4. **Head dim**: `head_dim_qk < 512` and `head_dim_v < 512`
+5. **Causal + window**: if causal mask, window must be `(-1, 0)` or `(>=0, 0)`
+6. **No mask + window**: if no mask, window must be `(-1, -1)` or `(>=0, >=0)`
+7. **QKV packed + GQA**: MQA/GQA cannot use qkvpacked layouts (`3HD`, `H3D`)
+8. **QKV packed + seqlen**: qkvpacked requires `s_q == s_kv`
+9. **THD + padding**: ragged (THD) format requires a padding mask type
+10. **Padding + bias**: padding mask cannot combine with `POST_SCALE_BIAS` or `ALIBI`


Those detailed info is good but it may change as our code base evolves. I was thinking, is it possible to give claude the commit hash that these info is valid at that moment, and tell claude to trace the changes afterwards. This way, claude won't be confused?

Good point, I've adjusted a mention of the hash at which it is accurate, and that it is subject to change. The tracing is something it should be able to reliably perform itself due to the inclusion of the code source.

wangye805 · 2026-04-13T15:14:59Z

+- `"Invalid type for 16 bit.."` — `DISPATCH_DTYPE_16BIT` macro failure.
+
+### From HIP runtime
+- `hipError_t` from `NVTE_CHECK_CUDA(...)` wrapping CK calls — usually a kernel launch failure or illegal memory access.


Perhaps point to a url with hiperror_t definition so that claude can have better understanding?

Do you have a good source that I can include?

The best, I think is https://rocm.docs.amd.com/projects/HIP/en/develop/reference/error_codes.html#hip-error-codes or maybe give it reference to enum HIP

wangye805 · 2026-04-13T15:16:19Z

+   ```
+5. Key argument mappings:
+   - `-iperm=1 -operm=1` → BSHD layout (TE default)
+   - `-iperm=0 -operm=0` → SBHD layout


If I recall correctly, it's BHSD for iperm/operm=0

From the cpp benchmark program argument description:

if true, will be b*h*s*d, else b*s*h*d

I've adjusted the skill to reflect this.

wangye805 · 2026-04-13T15:21:45Z

+---
+name: ifu-merge
+description: >
+  Guide for performing IFU (Internal Feature Update) merges on the TransformerEngine ROCm fork.


Not updated yet :-), did you forget to commit your changes?

wangye805 · 2026-04-13T15:24:23Z

+- ROCm-specific device behavior (e.g., tensor device masquerading)
+
+**What hipify preserves faithfully:**
+- Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds.


I think hipify is not smart enough to recognize those macros (USE_ROCM, or HIP_PLATFORM_AMD), it just simply do the search and replacement. Probably it's just lucky that those guarded sections do not change during hipification

I think we can leave this as-is, since it's still true in its capacity.

We can leave as-is or even shrink, relying on hipify section in CLAUDE.md

wangye805 · 2026-04-13T15:27:51Z

+1. Basic module import — catches missing symbols, broken dynamic linking
+2. Core operations (GEMM, normalization) — catches API mismatches, incorrect workspace sizing


We don't need to have pytorch/jax conflicts ready before common dir building and testing

Are you referring to the cpp tests here?

wangye805 · 2026-04-13T15:42:00Z

+When writing or updating memories in the project memory directory, follow these guidelines:
+
+- **Scope**: only save information that will be useful in future conversations. Do not save ephemeral task details, debugging breadcrumbs, or things derivable from the code/git history.
+- **Check before writing**: read `MEMORY.md` and check for an existing memory on the same topic before creating a new file. Update the existing memory instead of duplicating.


Where is the memory.md?

This is a user-local file that is an index to memory files left by Claude. This isn't something that is stored in the project.

Micky774 · 2026-04-13T17:41:04Z

Overall looks good to me. But generally how to use those mds? Do I need to import something when I start claude inside a docker container?

No. If you open Claude Code in the TE project repository (at TE root) then it will automatically pick up the files, and will automatically parse/use the skills if the context of the conversation ends up matching the YAML frontmatter descriptions. The CLAUDE.md is included in the initial context of ALL sessions started in the project.

VeeraRajasekhar · 2026-04-14T03:10:50Z

@Micky774 https://github.com/ROCm/amd-claude-marketplace/tree/main?tab=readme-ov-file#auto-register-the-marketplace-in-your-teams-repository

This is working as expected for internal users. For external users, it won't register this as known marketplaces as Claude cannot access it, if they try to /reload-plugins, it simply throws “plugin installation failed” message, without causing any issues or breakage in Claude. So, it should be safe to proceed with adding this.

and if the changes to this PR is completed, you can update the files in this repo with these updated skill files.

ipanfilo · 2026-04-23T03:10:48Z

+- `"Invalid type for 16 bit.."` — `DISPATCH_DTYPE_16BIT` macro failure.
+
+### From HIP runtime
+- `hipError_t` from `NVTE_CHECK_CUDA(...)` wrapping CK calls — usually a kernel launch failure or illegal memory access.


The best, I think is https://rocm.docs.amd.com/projects/HIP/en/develop/reference/error_codes.html#hip-error-codes or maybe give it reference to enum HIP

ipanfilo · 2026-04-23T03:15:07Z

+- ROCm-specific device behavior (e.g., tensor device masquerading)
+
+**What hipify preserves faithfully:**
+- Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds.


We can leave as-is or even shrink, relying on hipify section in CLAUDE.md

ipanfilo · 2026-04-23T03:23:52Z

+- Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds.
+
+**Rules that follow:**
+- Never edit `*_hip.cpp` or `.hip` files — they are regenerated from source files


transformer_engine/common/rocshmem_api/rocshmem_waitkernel.hip this file is excluded from hipification. I agree, foc consistency with the rest of code we can rename it to *.cpp (not to *.cu because we hipify all *.cu) @alextmagro

ipanfilo · 2026-04-23T03:27:57Z

+
+5. **Convention Changes**: Upstream changes a data format, tensor shape, or API contract without any code conflict. Every downstream consumer of that convention must be updated manually — the compiler won't catch these.
+
+**How to systematically audit:**


It may definitely be generic rule of thumb not IFU only. I do not think that this thing is obvious to always follow, especially fp8 semantics - in many cases compile time constants cannot be used but runtime detection is required

ipanfilo · 2026-04-23T03:34:25Z

+- 3rdparty submodules: `aiter`, `aotriton`, `cudnn-frontend`, `cutlass`, `googletest`, `hipify_torch`.
+
+## Hipify convention
+The build auto-generates HIP files from CUDA sources via `hipify_torch`. Generated files are marked with `// !!! This is a file automatically generated by hipify!!!` at line 1. **Never edit generated files directly** — edit the CUDA source instead.


If some file does not contain CUDA but only HIP code and it does not include headers containing CUDA code, such file can be excluded from hipification. It can be done in two ways: explicitly add to ignores list in do_hipify() in build_tools/hipify/hipify.py - which is useful for subdirectories containing HIP only code, or rely on HIPIFY to detect that file modification is not needed. In this case the file should have: #include "hip/hip_runtime.h" - real one or commented out, if the header is not really needed.

ipanfilo · 2026-04-23T03:46:32Z

+## Code conventions
+- Edit `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid `3rdparty/*` unless explicitly required.
+- Keep env-var behavior stable; tests toggle flags intentionally.
+- Python: Black, line length 100. C/C++: cpplint + `.clang-format`.


and pylintrc

Micky774 requested review from ipanfilo, wangye805 and wenchenvincent as code owners February 12, 2026 17:27

Micky774 force-pushed the zain/gh-copilot-instructions branch 7 times, most recently from c41a2c5 to fce9ca5 Compare March 4, 2026 22:56

Added initial AI Agent instructions

8a8ea81

Micky774 force-pushed the zain/gh-copilot-instructions branch from fce9ca5 to 8a8ea81 Compare March 5, 2026 18:53

Micky774 added 2 commits March 5, 2026 16:19

Updated w/ Claude's refinement

cdab339

Added IFU skill

7838ef9

ipanfilo reviewed Mar 26, 2026

View reviewed changes

Micky774 changed the title ~~Added initial GH Copilot instructions~~ Added initial AI Agent instructions and skills Mar 27, 2026

Updated w/ PR feedback

241831b

Micky774 requested a review from ipanfilo March 31, 2026 18:17