Fix flash-attention build env in CI deps Dockerfile#565
Conversation
8989d90 to
2f20c17
Compare
|
LGTM. |
|
|
||
| # AITER (submodule: composable_kernel / CK). Submodule update after checkout aligns CK with this commit. | ||
| # Install AITER | ||
| RUN git clone --recursive https://github.com/ROCm/aiter.git \ |
There was a problem hiding this comment.
You don't need --recursive if later calling git submodule update --init --recursive. Doing that you checkout submodules twice.
And if doing checkout later, makes sense to add --no-checkout to clone
There was a problem hiding this comment.
Remove PIP_FIND_LINKS and PIP_PRE. The used AITER commit does not depend of FlyDSL and there is no FlyDSL distribution for Python 3.11 anyway
| { | ||
| "docker_images": { | ||
| "default": "registry-sc-harbor.amd.com/framework/te-ci:rocm-7.2_ubuntu22.04_py3.11_pytorch_release-2.8_08d38866_jax_0.8.0_fa_2.8.1_aiter_77455e3ecf", | ||
| "default": "registry-sc-harbor.amd.com/framework/te-ci:rocm-7.2_ubuntu22.04_py3.11_pytorch_release-2.8_08d38866_jax_0.8.0_fa_2.8.1_aiter_77455e3ecf_v1", |
There was a problem hiding this comment.
For GHA the name can be reused, AFAIK - there is no image caching there
There was a problem hiding this comment.
No, I already pushed an image with "default": "registry-sc-harbor.amd.com/framework/te-ci:rocm-7.2_ubuntu22.04_py3.11_pytorch_release-2.8_08d38866_jax_0.8.0_fa_2.8.1_aiter_77455e3ecf", and later created a new image.
There was a problem hiding this comment.
Can that image not be overwritten?
There was a problem hiding this comment.
It should be possible to overwritten. In the works case old one can be deleted. With Jenkins we had to use uniq tags because there was image caching on workers that did not look at hash but only name/tag
There was a problem hiding this comment.
Updated to use the older tag
| && export GPU_ARCHS="gfx950;gfx942" \ | ||
| && FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE FLASH_ATTENTION_SKIP_CK_BUILD=FALSE python setup.py install \ |
There was a problem hiding this comment.
Why an export instead of feeding it as an env var before the python setup.py install?
There was a problem hiding this comment.
It is probably not needed at all for installation - nothing is compiled there
There was a problem hiding this comment.
@VeeraRajasekhar @sudhu2k can either of you confirm that the image as it is now passes the failing tests for MXFP4 we're seeing on dev?
Edit: The current running CI for this PR is using the updated image source.
https://github.com/ROCm/TransformerEngine/actions/runs/25014847020/job/73259852413 |
892dc54 to
ba241bf
Compare
…CI image tag
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: