Conversation
c41a2c5 to
fce9ca5
Compare
fce9ca5 to
8a8ea81
Compare
| --- | ||
| name: ifu-merge | ||
| description: > | ||
| Guide for performing IFU (Internal Feature Update) merges on the TransformerEngine ROCm fork. |
There was a problem hiding this comment.
IFU stands for intergrate from upstream
There was a problem hiding this comment.
Not updated yet :-), did you forget to commit your changes?
| - Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds. | ||
|
|
||
| **Rules that follow:** | ||
| - Never edit `*_hip.cpp` or `.hip` files — they are regenerated from source files |
There was a problem hiding this comment.
We have one exception of .hip file in the repo. Maybe we can rename it for consistency
There was a problem hiding this comment.
What's the exception? Renaming it would probably be best.
There was a problem hiding this comment.
transformer_engine/common/rocshmem_api/rocshmem_waitkernel.hip this file is excluded from hipification. I agree, foc consistency with the rest of code we can rename it to *.cpp (not to *.cu because we hipify all *.cu) @alextmagro
| |---|---|---| | ||
| | PyTorch CSRC (`.cpp` source files) | `#ifdef USE_ROCM` / `#ifndef USE_ROCM` | DeviceGuard, scale swizzling | | ||
| | Common layer (`.cu` files that get hipified) | `#ifdef __HIP_PLATFORM_AMD__` | Warp masks, kernel dispatch | | ||
| | Python code | `IS_HIP_EXTENSION` (from `torch.utils.cpp_extension`) | Workspace sizing, feature flags | |
There was a problem hiding this comment.
Also guard for JAX Python code
| git diff <rocm-parent>..<upstream-parent> --stat | ||
|
|
||
| # Check for removed guards | ||
| git diff <rocm-parent>..<upstream-parent> -- <file> | grep -E "^-.*(__HIP_PLATFORM_AMD__|USE_ROCM|IS_HIP_EXTENSION)" |
There was a problem hiding this comment.
I would also add "ROCm" and "upstream" - those are comments that indicate some changes made by us
|
|
||
| 5. **Convention Changes**: Upstream changes a data format, tensor shape, or API contract without any code conflict. Every downstream consumer of that convention must be updated manually — the compiler won't catch these. | ||
|
|
||
| **How to systematically audit:** |
There was a problem hiding this comment.
Should "what to pay attention to" points be here?
We have two big semantic differences:
__shfl vs __shfl_sync and other lane communication built-ins
fp8 data types: i.e. torch.float8_e4m3fn vs get_torch_e4m3_type, etc.
There was a problem hiding this comment.
I think that's something that is well-structured and systematic enough that it is "obvious" to Claude when working with the relevant files. Maybe it won't pick up the exact semantics, but it would pick up its functionality and consequences.
There was a problem hiding this comment.
Should "what to pay attention to" points be here? We have two big semantic differences: __shfl vs __shfl_sync and other lane communication built-ins fp8 data types: i.e. torch.float8_e4m3fn vs get_torch_e4m3_type, etc.
@ipanfilo I feel these are not just for IFUs. Maybe should be put in CLAUDE.md
There was a problem hiding this comment.
It may definitely be generic rule of thumb not IFU only. I do not think that this thing is obvious to always follow, especially fp8 semantics - in many cases compile time constants cannot be used but runtime detection is required
|
I have general comment. Because different agents use approximately the same skill but different default location, I asked some of them and the recommendation is to keep skills in some agent agnostic location like .ai-skills and make symlink(-s) to what is actually used on client side |
I was just chatting with Wen about this the other day but I think initially we're going to try to focus on supporting Claude Code since many other frameworks explicitly support claude's file locations, and because instruction optimization does genuinely differ framework-to-framework so narrowing down on one will help optimize for reliable and powerful workflows. @VeeraRajasekhar has also been looking at alternative ways to keep things generic, and has found a project that aims to automatically configure per-framework skill installs as you have mentioned, so we can also explore that in the future too. Plus as we migrate to a plugin based approach, that'll be a bit easier for folks to generically install across frameworks. |
wangye805
left a comment
There was a problem hiding this comment.
Overall looks good to me. But generally how to use those mds? Do I need to import something when I start claude inside a docker container?
| The function in `fused_attn_ck.cpp:23-152` applies these filters in order. When CK is rejected, `NVTE_LOG_CK_CONFIG=1` prints the reason. The filters are: | ||
|
|
||
| 1. **GQA groups**: `num_gqa_groups > 0` and `num_attn_heads % num_gqa_groups == 0` | ||
| 2. **Data type**: `q_dtype == kv_dtype` and both are fp16 or bf16 (no fp8) | ||
| 3. **Bias type**: only `NO_BIAS`, `ALIBI`, or `POST_SCALE_BIAS` (no `PRE_SCALE_BIAS`) | ||
| 4. **Head dim**: `head_dim_qk < 512` and `head_dim_v < 512` | ||
| 5. **Causal + window**: if causal mask, window must be `(-1, 0)` or `(>=0, 0)` | ||
| 6. **No mask + window**: if no mask, window must be `(-1, -1)` or `(>=0, >=0)` | ||
| 7. **QKV packed + GQA**: MQA/GQA cannot use qkvpacked layouts (`3HD`, `H3D`) | ||
| 8. **QKV packed + seqlen**: qkvpacked requires `s_q == s_kv` | ||
| 9. **THD + padding**: ragged (THD) format requires a padding mask type | ||
| 10. **Padding + bias**: padding mask cannot combine with `POST_SCALE_BIAS` or `ALIBI` |
There was a problem hiding this comment.
Those detailed info is good but it may change as our code base evolves. I was thinking, is it possible to give claude the commit hash that these info is valid at that moment, and tell claude to trace the changes afterwards. This way, claude won't be confused?
There was a problem hiding this comment.
Good point, I've adjusted a mention of the hash at which it is accurate, and that it is subject to change. The tracing is something it should be able to reliably perform itself due to the inclusion of the code source.
| - `"Invalid type for 16 bit.."` — `DISPATCH_DTYPE_16BIT` macro failure. | ||
|
|
||
| ### From HIP runtime | ||
| - `hipError_t` from `NVTE_CHECK_CUDA(...)` wrapping CK calls — usually a kernel launch failure or illegal memory access. |
There was a problem hiding this comment.
Perhaps point to a url with hiperror_t definition so that claude can have better understanding?
There was a problem hiding this comment.
Do you have a good source that I can include?
There was a problem hiding this comment.
The best, I think is https://rocm.docs.amd.com/projects/HIP/en/develop/reference/error_codes.html#hip-error-codes or maybe give it reference to enum HIP
| ``` | ||
| 5. Key argument mappings: | ||
| - `-iperm=1 -operm=1` → BSHD layout (TE default) | ||
| - `-iperm=0 -operm=0` → SBHD layout |
There was a problem hiding this comment.
If I recall correctly, it's BHSD for iperm/operm=0
There was a problem hiding this comment.
From the cpp benchmark program argument description:
if true, will be b*h*s*d, else b*s*h*d
I've adjusted the skill to reflect this.
| --- | ||
| name: ifu-merge | ||
| description: > | ||
| Guide for performing IFU (Internal Feature Update) merges on the TransformerEngine ROCm fork. |
There was a problem hiding this comment.
Not updated yet :-), did you forget to commit your changes?
| - ROCm-specific device behavior (e.g., tensor device masquerading) | ||
|
|
||
| **What hipify preserves faithfully:** | ||
| - Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds. |
There was a problem hiding this comment.
I think hipify is not smart enough to recognize those macros (USE_ROCM, or HIP_PLATFORM_AMD), it just simply do the search and replacement. Probably it's just lucky that those guarded sections do not change during hipification
There was a problem hiding this comment.
I think we can leave this as-is, since it's still true in its capacity.
There was a problem hiding this comment.
We can leave as-is or even shrink, relying on hipify section in CLAUDE.md
| 1. Basic module import — catches missing symbols, broken dynamic linking | ||
| 2. Core operations (GEMM, normalization) — catches API mismatches, incorrect workspace sizing |
There was a problem hiding this comment.
We don't need to have pytorch/jax conflicts ready before common dir building and testing
There was a problem hiding this comment.
Are you referring to the cpp tests here?
| When writing or updating memories in the project memory directory, follow these guidelines: | ||
|
|
||
| - **Scope**: only save information that will be useful in future conversations. Do not save ephemeral task details, debugging breadcrumbs, or things derivable from the code/git history. | ||
| - **Check before writing**: read `MEMORY.md` and check for an existing memory on the same topic before creating a new file. Update the existing memory instead of duplicating. |
There was a problem hiding this comment.
Where is the memory.md?
There was a problem hiding this comment.
This is a user-local file that is an index to memory files left by Claude. This isn't something that is stored in the project.
No. If you open Claude Code in the TE project repository (at TE root) then it will automatically pick up the files, and will automatically parse/use the skills if the context of the conversation ends up matching the YAML frontmatter descriptions. The CLAUDE.md is included in the initial context of ALL sessions started in the project. |
|
@Micky774 https://github.com/ROCm/amd-claude-marketplace/tree/main?tab=readme-ov-file#auto-register-the-marketplace-in-your-teams-repository This is working as expected for internal users. For external users, it won't register this as known marketplaces as Claude cannot access it, if they try to /reload-plugins, it simply throws “plugin installation failed” message, without causing any issues or breakage in Claude. So, it should be safe to proceed with adding this. and if the changes to this PR is completed, you can update the files in this repo with these updated skill files. |
| - `"Invalid type for 16 bit.."` — `DISPATCH_DTYPE_16BIT` macro failure. | ||
|
|
||
| ### From HIP runtime | ||
| - `hipError_t` from `NVTE_CHECK_CUDA(...)` wrapping CK calls — usually a kernel launch failure or illegal memory access. |
There was a problem hiding this comment.
The best, I think is https://rocm.docs.amd.com/projects/HIP/en/develop/reference/error_codes.html#hip-error-codes or maybe give it reference to enum HIP
| - ROCm-specific device behavior (e.g., tensor device masquerading) | ||
|
|
||
| **What hipify preserves faithfully:** | ||
| - Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds. |
There was a problem hiding this comment.
We can leave as-is or even shrink, relying on hipify section in CLAUDE.md
| - Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds. | ||
|
|
||
| **Rules that follow:** | ||
| - Never edit `*_hip.cpp` or `.hip` files — they are regenerated from source files |
There was a problem hiding this comment.
transformer_engine/common/rocshmem_api/rocshmem_waitkernel.hip this file is excluded from hipification. I agree, foc consistency with the rest of code we can rename it to *.cpp (not to *.cu because we hipify all *.cu) @alextmagro
|
|
||
| 5. **Convention Changes**: Upstream changes a data format, tensor shape, or API contract without any code conflict. Every downstream consumer of that convention must be updated manually — the compiler won't catch these. | ||
|
|
||
| **How to systematically audit:** |
There was a problem hiding this comment.
It may definitely be generic rule of thumb not IFU only. I do not think that this thing is obvious to always follow, especially fp8 semantics - in many cases compile time constants cannot be used but runtime detection is required
| - 3rdparty submodules: `aiter`, `aotriton`, `cudnn-frontend`, `cutlass`, `googletest`, `hipify_torch`. | ||
|
|
||
| ## Hipify convention | ||
| The build auto-generates HIP files from CUDA sources via `hipify_torch`. Generated files are marked with `// !!! This is a file automatically generated by hipify!!!` at line 1. **Never edit generated files directly** — edit the CUDA source instead. |
There was a problem hiding this comment.
If some file does not contain CUDA but only HIP code and it does not include headers containing CUDA code, such file can be excluded from hipification. It can be done in two ways: explicitly add to ignores list in do_hipify() in build_tools/hipify/hipify.py - which is useful for subdirectories containing HIP only code, or rely on HIPIFY to detect that file modification is not needed. In this case the file should have: #include "hip/hip_runtime.h" - real one or commented out, if the header is not really needed.
| ## Code conventions | ||
| - Edit `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid `3rdparty/*` unless explicitly required. | ||
| - Keep env-var behavior stable; tests toggle flags intentionally. | ||
| - Python: Black, line length 100. C/C++: cpplint + `.clang-format`. |
Description
Includes an initial addition of repository-level AI agent instructions/context via
CLAUDE.mdas well as example skills via.claude/**/SKILL.md. This mainly serves as a demonstration of how to add additional context to AI coding agents, as well as how to develop a reasonably-complex skill.TODO: Back-test against old cases and refine as needed
Type of change
Changes
Please list the changes introduced in this PR:
CLAUDE.md.claude/ck-debugging/SKILL.md.claude/ifu-merge/SKILL.mdChecklist: