Skip to content

Added initial AI Agent instructions and skills#448

Open
Micky774 wants to merge 7 commits intodevfrom
zain/gh-copilot-instructions
Open

Added initial AI Agent instructions and skills#448
Micky774 wants to merge 7 commits intodevfrom
zain/gh-copilot-instructions

Conversation

@Micky774
Copy link
Copy Markdown
Contributor

@Micky774 Micky774 commented Feb 12, 2026

Description

Includes an initial addition of repository-level AI agent instructions/context via CLAUDE.md as well as example skills via .claude/**/SKILL.md. This mainly serves as a demonstration of how to add additional context to AI coding agents, as well as how to develop a reasonably-complex skill.

TODO: Back-test against old cases and refine as needed

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Adds CLAUDE.md
  • Adds .claude/ck-debugging/SKILL.md
  • Adds .claude/ifu-merge/SKILL.md

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@Micky774 Micky774 force-pushed the zain/gh-copilot-instructions branch 7 times, most recently from c41a2c5 to fce9ca5 Compare March 4, 2026 22:56
@Micky774 Micky774 force-pushed the zain/gh-copilot-instructions branch from fce9ca5 to 8a8ea81 Compare March 5, 2026 18:53
Comment thread .claude/skills/ifu-merge/SKILL.md Outdated
---
name: ifu-merge
description: >
Guide for performing IFU (Internal Feature Update) merges on the TransformerEngine ROCm fork.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IFU stands for intergrate from upstream

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not updated yet :-), did you forget to commit your changes?

- Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds.

**Rules that follow:**
- Never edit `*_hip.cpp` or `.hip` files — they are regenerated from source files
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have one exception of .hip file in the repo. Maybe we can rename it for consistency

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the exception? Renaming it would probably be best.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformer_engine/common/rocshmem_api/rocshmem_waitkernel.hip this file is excluded from hipification. I agree, foc consistency with the rest of code we can rename it to *.cpp (not to *.cu because we hipify all *.cu) @alextmagro

|---|---|---|
| PyTorch CSRC (`.cpp` source files) | `#ifdef USE_ROCM` / `#ifndef USE_ROCM` | DeviceGuard, scale swizzling |
| Common layer (`.cu` files that get hipified) | `#ifdef __HIP_PLATFORM_AMD__` | Warp masks, kernel dispatch |
| Python code | `IS_HIP_EXTENSION` (from `torch.utils.cpp_extension`) | Workspace sizing, feature flags |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also guard for JAX Python code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

git diff <rocm-parent>..<upstream-parent> --stat

# Check for removed guards
git diff <rocm-parent>..<upstream-parent> -- <file> | grep -E "^-.*(__HIP_PLATFORM_AMD__|USE_ROCM|IS_HIP_EXTENSION)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add "ROCm" and "upstream" - those are comments that indicate some changes made by us

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


5. **Convention Changes**: Upstream changes a data format, tensor shape, or API contract without any code conflict. Every downstream consumer of that convention must be updated manually — the compiler won't catch these.

**How to systematically audit:**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "what to pay attention to" points be here?
We have two big semantic differences:
__shfl vs __shfl_sync and other lane communication built-ins
fp8 data types: i.e. torch.float8_e4m3fn vs get_torch_e4m3_type, etc.

Copy link
Copy Markdown
Contributor Author

@Micky774 Micky774 Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's something that is well-structured and systematic enough that it is "obvious" to Claude when working with the relevant files. Maybe it won't pick up the exact semantics, but it would pick up its functionality and consequences.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "what to pay attention to" points be here? We have two big semantic differences: __shfl vs __shfl_sync and other lane communication built-ins fp8 data types: i.e. torch.float8_e4m3fn vs get_torch_e4m3_type, etc.

@ipanfilo I feel these are not just for IFUs. Maybe should be put in CLAUDE.md

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may definitely be generic rule of thumb not IFU only. I do not think that this thing is obvious to always follow, especially fp8 semantics - in many cases compile time constants cannot be used but runtime detection is required

@Micky774 Micky774 changed the title Added initial GH Copilot instructions Added initial AI Agent instructions and skills Mar 27, 2026
@Micky774 Micky774 requested a review from ipanfilo March 31, 2026 18:17
Comment thread .claude/skills/ck-debugging/SKILL.md Outdated
Comment thread CLAUDE.md Outdated
Comment thread CLAUDE.md Outdated
Comment thread CLAUDE.md Outdated
Comment thread CLAUDE.md Outdated
Comment thread CLAUDE.md
@ipanfilo
Copy link
Copy Markdown
Collaborator

I have general comment. Because different agents use approximately the same skill but different default location, I asked some of them and the recommendation is to keep skills in some agent agnostic location like .ai-skills and make symlink(-s) to what is actually used on client side

@Micky774
Copy link
Copy Markdown
Contributor Author

I have general comment. Because different agents use approximately the same skill but different default location, I asked some of them and the recommendation is to keep skills in some agent agnostic location like .ai-skills and make symlink(-s) to what is actually used on client side

I was just chatting with Wen about this the other day but I think initially we're going to try to focus on supporting Claude Code since many other frameworks explicitly support claude's file locations, and because instruction optimization does genuinely differ framework-to-framework so narrowing down on one will help optimize for reliable and powerful workflows.

@VeeraRajasekhar has also been looking at alternative ways to keep things generic, and has found a project that aims to automatically configure per-framework skill installs as you have mentioned, so we can also explore that in the future too.

Plus as we migrate to a plugin based approach, that'll be a bit easier for folks to generically install across frameworks.

Copy link
Copy Markdown
Collaborator

@wangye805 wangye805 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. But generally how to use those mds? Do I need to import something when I start claude inside a docker container?

Comment thread .claude/skills/ck-fused-attention-debugging/SKILL.md
Comment on lines +130 to +141
The function in `fused_attn_ck.cpp:23-152` applies these filters in order. When CK is rejected, `NVTE_LOG_CK_CONFIG=1` prints the reason. The filters are:

1. **GQA groups**: `num_gqa_groups > 0` and `num_attn_heads % num_gqa_groups == 0`
2. **Data type**: `q_dtype == kv_dtype` and both are fp16 or bf16 (no fp8)
3. **Bias type**: only `NO_BIAS`, `ALIBI`, or `POST_SCALE_BIAS` (no `PRE_SCALE_BIAS`)
4. **Head dim**: `head_dim_qk < 512` and `head_dim_v < 512`
5. **Causal + window**: if causal mask, window must be `(-1, 0)` or `(>=0, 0)`
6. **No mask + window**: if no mask, window must be `(-1, -1)` or `(>=0, >=0)`
7. **QKV packed + GQA**: MQA/GQA cannot use qkvpacked layouts (`3HD`, `H3D`)
8. **QKV packed + seqlen**: qkvpacked requires `s_q == s_kv`
9. **THD + padding**: ragged (THD) format requires a padding mask type
10. **Padding + bias**: padding mask cannot combine with `POST_SCALE_BIAS` or `ALIBI`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those detailed info is good but it may change as our code base evolves. I was thinking, is it possible to give claude the commit hash that these info is valid at that moment, and tell claude to trace the changes afterwards. This way, claude won't be confused?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've adjusted a mention of the hash at which it is accurate, and that it is subject to change. The tracing is something it should be able to reliably perform itself due to the inclusion of the code source.

- `"Invalid type for 16 bit.."` — `DISPATCH_DTYPE_16BIT` macro failure.

### From HIP runtime
- `hipError_t` from `NVTE_CHECK_CUDA(...)` wrapping CK calls — usually a kernel launch failure or illegal memory access.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps point to a url with hiperror_t definition so that claude can have better understanding?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a good source that I can include?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best, I think is https://rocm.docs.amd.com/projects/HIP/en/develop/reference/error_codes.html#hip-error-codes or maybe give it reference to enum HIP

```
5. Key argument mappings:
- `-iperm=1 -operm=1` → BSHD layout (TE default)
- `-iperm=0 -operm=0` → SBHD layout
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, it's BHSD for iperm/operm=0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the cpp benchmark program argument description:

if true, will be b*h*s*d, else b*s*h*d

I've adjusted the skill to reflect this.

Comment thread .claude/skills/ifu-merge/SKILL.md Outdated
---
name: ifu-merge
description: >
Guide for performing IFU (Internal Feature Update) merges on the TransformerEngine ROCm fork.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not updated yet :-), did you forget to commit your changes?

- ROCm-specific device behavior (e.g., tensor device masquerading)

**What hipify preserves faithfully:**
- Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hipify is not smart enough to recognize those macros (USE_ROCM, or HIP_PLATFORM_AMD), it just simply do the search and replacement. Probably it's just lucky that those guarded sections do not change during hipification

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave this as-is, since it's still true in its capacity.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave as-is or even shrink, relying on hipify section in CLAUDE.md

Comment on lines +151 to +152
1. Basic module import — catches missing symbols, broken dynamic linking
2. Core operations (GEMM, normalization) — catches API mismatches, incorrect workspace sizing
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to have pytorch/jax conflicts ready before common dir building and testing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the cpp tests here?

Comment thread CLAUDE.md
When writing or updating memories in the project memory directory, follow these guidelines:

- **Scope**: only save information that will be useful in future conversations. Do not save ephemeral task details, debugging breadcrumbs, or things derivable from the code/git history.
- **Check before writing**: read `MEMORY.md` and check for an existing memory on the same topic before creating a new file. Update the existing memory instead of duplicating.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the memory.md?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a user-local file that is an index to memory files left by Claude. This isn't something that is stored in the project.

@Micky774
Copy link
Copy Markdown
Contributor Author

Overall looks good to me. But generally how to use those mds? Do I need to import something when I start claude inside a docker container?

No. If you open Claude Code in the TE project repository (at TE root) then it will automatically pick up the files, and will automatically parse/use the skills if the context of the conversation ends up matching the YAML frontmatter descriptions. The CLAUDE.md is included in the initial context of ALL sessions started in the project.

@VeeraRajasekhar
Copy link
Copy Markdown
Contributor

@Micky774 https://github.com/ROCm/amd-claude-marketplace/tree/main?tab=readme-ov-file#auto-register-the-marketplace-in-your-teams-repository

This is working as expected for internal users. For external users, it won't register this as known marketplaces as Claude cannot access it, if they try to /reload-plugins, it simply throws “plugin installation failed” message, without causing any issues or breakage in Claude. So, it should be safe to proceed with adding this.

and if the changes to this PR is completed, you can update the files in this repo with these updated skill files.

- `"Invalid type for 16 bit.."` — `DISPATCH_DTYPE_16BIT` macro failure.

### From HIP runtime
- `hipError_t` from `NVTE_CHECK_CUDA(...)` wrapping CK calls — usually a kernel launch failure or illegal memory access.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best, I think is https://rocm.docs.amd.com/projects/HIP/en/develop/reference/error_codes.html#hip-error-codes or maybe give it reference to enum HIP

- ROCm-specific device behavior (e.g., tensor device masquerading)

**What hipify preserves faithfully:**
- Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave as-is or even shrink, relying on hipify section in CLAUDE.md

- Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds.

**Rules that follow:**
- Never edit `*_hip.cpp` or `.hip` files — they are regenerated from source files
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformer_engine/common/rocshmem_api/rocshmem_waitkernel.hip this file is excluded from hipification. I agree, foc consistency with the rest of code we can rename it to *.cpp (not to *.cu because we hipify all *.cu) @alextmagro


5. **Convention Changes**: Upstream changes a data format, tensor shape, or API contract without any code conflict. Every downstream consumer of that convention must be updated manually — the compiler won't catch these.

**How to systematically audit:**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may definitely be generic rule of thumb not IFU only. I do not think that this thing is obvious to always follow, especially fp8 semantics - in many cases compile time constants cannot be used but runtime detection is required

Comment thread CLAUDE.md
- 3rdparty submodules: `aiter`, `aotriton`, `cudnn-frontend`, `cutlass`, `googletest`, `hipify_torch`.

## Hipify convention
The build auto-generates HIP files from CUDA sources via `hipify_torch`. Generated files are marked with `// !!! This is a file automatically generated by hipify!!!` at line 1. **Never edit generated files directly** — edit the CUDA source instead.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some file does not contain CUDA but only HIP code and it does not include headers containing CUDA code, such file can be excluded from hipification. It can be done in two ways: explicitly add to ignores list in do_hipify() in build_tools/hipify/hipify.py - which is useful for subdirectories containing HIP only code, or rely on HIPIFY to detect that file modification is not needed. In this case the file should have: #include "hip/hip_runtime.h" - real one or commented out, if the header is not really needed.

Comment thread CLAUDE.md
## Code conventions
- Edit `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid `3rdparty/*` unless explicitly required.
- Keep env-var behavior stable; tests toggle flags intentionally.
- Python: Black, line length 100. C/C++: cpplint + `.clang-format`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and pylintrc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants