Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions .github/scripts/select_te_docker_image_ci_config.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/usr/bin/env bash
# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
#
# Resolve TE CI Docker image from ci/ci_config.json (same rules as rocm-ci.yml).
# Env: TEST_CONFIG_FROM_SOURCE (true/false), DOCKER_IMAGE_OVERRIDE (optional),
# GITHUB_OUTPUT (path), GITHUB_REF_NAME, GITHUB_BASE_REF (optional).
# Writes: image-tag=<image> to GITHUB_OUTPUT.
set -euo pipefail

CONFIG_BRANCH="dev"
if [[ "${TEST_CONFIG_FROM_SOURCE:-false}" == "true" ]]; then
CONFIG_BRANCH="${GITHUB_REF_NAME:?}"
echo "::notice::Debugging mode: Fetching config from current branch ($CONFIG_BRANCH)"
fi

CONFIG_URL="https://raw.githubusercontent.com/ROCm/TransformerEngine/${CONFIG_BRANCH}/ci/ci_config.json"
echo "Attempting to fetch image config from: $CONFIG_URL"

if curl -s -f -o docker_config.json "$CONFIG_URL"; then
echo "Successfully downloaded config from $CONFIG_BRANCH."
else
echo "::warning::Failed to fetch config from $CONFIG_BRANCH (File might not exist yet)."

if [[ -f "ci/ci_config.json" ]]; then
echo "::notice::Falling back to local 'ci/ci_config.json' from checkout."
cp ci/ci_config.json docker_config.json
else
echo "::error::Config file not found in $CONFIG_BRANCH OR locally."
exit 1
fi
fi

# Match rocm-ci.yml: github.base_ref || github.ref_name (caller sets TE_CI_BRANCH_FOR_IMAGE_KEY)
BRANCH_NAME="${TE_CI_BRANCH_FOR_IMAGE_KEY:-${GITHUB_REF_NAME:?}}"
echo "Determining image for branch: $BRANCH_NAME"

JSON_KEY="default"

if [[ $BRANCH_NAME =~ ^release_v([0-9]+\.[0-9]+)_rocm$ ]]; then
VERSION_KEY="release_v${BASH_REMATCH[1]}"
if [[ $(jq "(.docker_images | has(\"$VERSION_KEY\"))" docker_config.json) == "true" ]]; then
JSON_KEY="$VERSION_KEY"
fi
fi

echo "Selected config key: $JSON_KEY"

IMAGE_TO_USE=$(jq -r ".docker_images.\"$JSON_KEY\"" docker_config.json)

if [[ -n "${DOCKER_IMAGE_OVERRIDE:-}" ]]; then
echo "::notice::Manual override detected: $DOCKER_IMAGE_OVERRIDE"
IMAGE_TO_USE="$DOCKER_IMAGE_OVERRIDE"
fi

echo "Selected image: $IMAGE_TO_USE"
echo "image-tag=${IMAGE_TO_USE}" >> "${GITHUB_OUTPUT:?}"
52 changes: 45 additions & 7 deletions .github/workflows/aiter-prebuilt-upload.yml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing to dev is too late. Binaries are expected to be cached when PR is created, otherwise the PR CI will have to rebuild them

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially started with an idea to provide a PR comment trigger which does this. Do you think it is better? In this case I might need to provide a way to force-push if the user needs to trigger multiple times while working on the PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it can be driven by CI labels + filter by specific path modification. I.e. on the first CI run after aiter commit update, it first builds and uploads AITER and then goes further with CI.

Copy link
Copy Markdown
Contributor

@Micky774 Micky774 Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note, can we have an upload as a side effect during the build-and-test workflow? That would provide a relatively simple way to implement this.

Specifically, we can do this more-or-less as-is by using the following filtering for a prebuilt cache upload:

  on:                                                                                                                                       
    pull_request: 
      paths:
        - '3rdparty/aiter'
        - '3rdpart/aiter/***'

Not sure which one exactly is needed since aiter is a submodule but one should work. Still, I'd be more interested in conditionally checking whether the AITER submodule was built from source, and then uploading if it was in the build-and-test flow.

Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,27 @@
name: AITER Prebuilt Upload

on:
# workflow_dispatch: optional image + vars.DEV_DOCKER_IMAGE fallback.
workflow_dispatch:
inputs:
docker_image:
description: "Docker image"
required: false
default: ""
# workflow_call: caller must pass docker_image.
workflow_call:
inputs:
docker_image:
description: "Docker image URI from the caller (e.g. rocm-ci select_docker_image.outputs.image-tag)"
required: true
type: string

permissions:
contents: read

concurrency:
group: aiter-prebuilt-dev
cancel-in-progress: false

jobs:
upload:
Expand All @@ -25,15 +40,38 @@ jobs:
- name: Resolve docker image
id: cfg
run: |
IMAGE="${{ inputs.docker_image }}"
if [ -z "$IMAGE" ]; then
IMAGE="${{ vars.DEV_DOCKER_IMAGE }}"
set -euo pipefail
out_image() {
echo "image=$1" >> "$GITHUB_OUTPUT"
}

EVENT="${{ github.event_name }}"
if [ "$EVENT" = "workflow_dispatch" ]; then
IMAGE="${{ inputs.docker_image }}"
if [ -z "$IMAGE" ]; then
IMAGE="${{ vars.DEV_DOCKER_IMAGE }}"
fi
if [ -z "$IMAGE" ]; then
echo "No docker image provided and vars.DEV_DOCKER_IMAGE is empty." >&2
exit 1
fi
out_image "$IMAGE"
exit 0
fi
if [ -z "$IMAGE" ]; then
echo "No docker image provided and vars.DEV_DOCKER_IMAGE is empty." >&2
exit 1

if [ "$EVENT" = "workflow_call" ]; then
IMAGE="${{ inputs.docker_image }}"
if [ -z "$IMAGE" ]; then
echo "workflow_call requires non-empty docker_image." >&2
exit 1
fi
echo "Using docker_image from caller."
out_image "$IMAGE"
exit 0
fi
echo "image=${IMAGE}" >> $GITHUB_OUTPUT

echo "Unsupported event: $EVENT" >&2
exit 1

- name: Pull docker image
run: docker pull ${{ steps.cfg.outputs.image }}
Expand Down
26 changes: 25 additions & 1 deletion .github/workflows/rocm-ci-dispatch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,29 @@ permissions:
contents: read

jobs:
# Sets trigger_aiter_upload for rocm-ci when 3rdparty/aiter changed vs the PR base.
aiter_upload_trigger:
name: PR — set trigger_aiter_upload
runs-on: ubuntu-latest
outputs:
trigger_aiter_upload: ${{ steps.set.outputs.trigger_aiter_upload }}
steps:
- uses: actions/checkout@v6
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
aiter:
- '3rdparty/aiter/**'
- '3rdparty/aiter'
- id: set
run: |
if [ "${{ steps.filter.outputs.aiter }}" == "true" ]; then
echo "trigger_aiter_upload=true" >> "$GITHUB_OUTPUT"
else
echo "trigger_aiter_upload=false" >> "$GITHUB_OUTPUT"
fi

determine_level:
runs-on: ubuntu-latest
outputs:
Expand Down Expand Up @@ -53,9 +76,10 @@ jobs:
# - A commit was pushed with existing ci-level label(s)
# - The PR was reopened or opened with existing ci-level label(s)
if: ${{ needs.determine_level.outputs.test_level != '' }}
needs: determine_level
needs: [determine_level, aiter_upload_trigger]
name: CI Level ${{ needs.determine_level.outputs.test_level }}
uses: ./.github/workflows/rocm-ci.yml
secrets: inherit
with:
test_level: ${{ needs.determine_level.outputs.test_level }}
trigger_aiter_upload: ${{ needs.aiter_upload_trigger.outputs.trigger_aiter_upload == 'true' }}
105 changes: 40 additions & 65 deletions .github/workflows/rocm-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,11 @@ on:
required: false
default: false
type: boolean
trigger_aiter_upload:
description: 'True when 3rdparty/aiter changed on the PR (set by rocm-ci-dispatch)'
required: false
default: false
type: boolean
workflow_dispatch:
inputs:
test_level:
Expand All @@ -40,14 +45,47 @@ on:
description: 'DEBUG: Use config.json from current source branch instead of dev'
type: boolean
default: false
trigger_aiter_upload:
description: 'Advanced; PR path uses rocm-ci-dispatch. Default false.'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workflow_dispatch is not PR triggered action. Also no need to specify default in description

required: false
default: false
type: boolean

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
select_docker_image:
name: Select Docker Image Tag
runs-on: ubuntu-latest
outputs:
image-tag: ${{ steps.select.outputs.image-tag }}
steps:
- uses: actions/checkout@v6
- name: Resolve image from ci_config.json
id: select
env:
TEST_CONFIG_FROM_SOURCE: ${{ inputs.test_config_from_source }}
DOCKER_IMAGE_OVERRIDE: ${{ inputs.docker_image_override }}
TE_CI_BRANCH_FOR_IMAGE_KEY: ${{ github.base_ref || github.ref_name }}
run: bash .github/scripts/select_te_docker_image_ci_config.sh

# Runs before the GPU matrix only when inputs.trigger_aiter_upload is true (rocm-ci-dispatch or manual dispatch).
# Not run on push to dev/release — assume prebuilts were published during the PR CI run.
upload_aiter_prebuilt:
name: Build and upload AITER prebuilt (before CI)
needs: select_docker_image
if: ${{ (github.event_name == 'workflow_call' || github.event_name == 'workflow_dispatch') && inputs.trigger_aiter_upload }}
uses: ./.github/workflows/aiter-prebuilt-upload.yml
with:
docker_image: ${{ needs.select_docker_image.outputs.image-tag }}
secrets: inherit

build_and_test:
name: Build and Test on GPU (${{ matrix.runner }}) - Level ${{ (github.event_name == 'push' && '3') || inputs.test_level || '1' }}
needs: [upload_aiter_prebuilt, select_docker_image]
if: always() && needs.select_docker_image.result == 'success' && (needs.upload_aiter_prebuilt.result == 'skipped' || needs.upload_aiter_prebuilt.result == 'success')
timeout-minutes: 720
runs-on: ${{ matrix.runner }}
strategy:
Expand Down Expand Up @@ -101,72 +139,9 @@ jobs:
env | sort
echo "::endgroup::"

- name: Select Docker Image Tag
id: select-image
run: |
# Determine config source
# Default we are fetching from 'dev' branch
CONFIG_BRANCH="dev"

# If manual run requesting source config, switch branch
if [[ "${{ inputs.test_config_from_source }}" == "true" ]]; then
CONFIG_BRANCH="${{ github.ref_name }}"
echo "::notice::Debugging mode: Fetching config from current branch ($CONFIG_BRANCH)"
fi

# Download config
CONFIG_URL="https://raw.githubusercontent.com/ROCm/TransformerEngine/${CONFIG_BRANCH}/ci/ci_config.json"
echo "Attempting to fetch image config from: $CONFIG_URL"

if curl -s -f -o docker_config.json "$CONFIG_URL"; then
echo "Successfully downloaded config from $CONFIG_BRANCH."
else
echo "::warning::Failed to fetch config from $CONFIG_BRANCH (File might not exist yet)."

# Fallback: Check source branch file
if [[ -f "ci/ci_config.json" ]]; then
echo "::notice::Falling back to local 'ci/ci_config.json' from checkout."
cp ci/ci_config.json docker_config.json
else
echo "::error::Config file not found in $CONFIG_BRANCH OR locally."
exit 1
fi
fi

# Determine image key
BRANCH_NAME="${{ github.base_ref || github.ref_name }}"
echo "Determining image for branch: $BRANCH_NAME"

# Logic: Check if branch matches "release_vX.X".
# If so, look for that key in JSON. Otherwise default.
JSON_KEY="default"

if [[ $BRANCH_NAME =~ ^release_v([0-9]+\.[0-9]+)_rocm$ ]]; then
VERSION_KEY="release_v${BASH_REMATCH[1]}"
# Check if this specific version key exists in the JSON
if [[ $(jq "(.docker_images | has(\"$VERSION_KEY\"))" docker_config.json) == "true" ]]; then
JSON_KEY="$VERSION_KEY"
fi
fi

echo "Selected config key: $JSON_KEY"

# Extract image name from json
IMAGE_TO_USE=$(jq -r ".docker_images.\"$JSON_KEY\"" docker_config.json)

# Check input from workflow_dispatch overriding the image
MANUAL_OVERRIDE="${{ inputs.docker_image_override }}"
if [[ -n "$MANUAL_OVERRIDE" ]]; then
echo "::notice::Manual override detected: $MANUAL_OVERRIDE"
IMAGE_TO_USE="$MANUAL_OVERRIDE"
fi

echo "Selected image: $IMAGE_TO_USE"
echo "image-tag=$IMAGE_TO_USE" >> $GITHUB_OUTPUT

- name: Pull Docker Image
run: |
docker pull ${{ steps.select-image.outputs.image-tag }}
docker pull ${{ needs.select_docker_image.outputs.image-tag }}

- name: Run Container
run: |
Expand All @@ -180,7 +155,7 @@ jobs:
--group-add $(getent group video | cut -d: -f3) \
-v "${{ github.workspace }}:/workspace" \
-w /workspace \
${{ steps.select-image.outputs.image-tag}}
${{ needs.select_docker_image.outputs.image-tag }}

- name: Container Diagnostics & GPU Setup
id: container-diag
Expand Down
Loading