Conversation
- Introduced multiple scripts for managing GitHub Actions runners within an Apptainer container, including build, run, and cleanup functionalities. - Added comprehensive documentation (README.md, AGENTS.md, skills.md) detailing setup, usage, and environment configuration. - Included example environment files for both runner and container configurations to guide users in setting up their environments securely and effectively. This update establishes a robust framework for running GitHub Actions in HPC environments, enhancing usability and maintainability.
- Introduced a new workflow file to automate the setup of a Python virtual environment for Copilot. - The workflow includes steps for checking out the repository, creating and activating a virtual environment, installing dependencies, and verifying ROCm and GPU visibility. - This addition enhances the CI/CD process by streamlining the environment setup for Copilot integration.
|
@copilot run example 07_gemm_all_scatter and report back the teraflops. |
|
@JoseSantosAMD I've opened a new pull request, #400, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Pull request overview
Adds infrastructure to run Iris workflows on an AMD/ROCm-capable self-hosted “copilot” GitHub Actions runner, with Apptainer + SLURM scripts to build/run/manage the runner container.
Changes:
- Introduces a
copilot-setup-stepsworkflow to bootstrap a Python venv and sanity-check ROCm visibility on the self-hosted runner. - Adds Apptainer/SLURM runner lifecycle scripts (
build,run,start) plus configuration templates. - Adds documentation for setup/usage/troubleshooting of the coding-agent runner environment.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/copilot-setup-steps.yml | Manual workflow to prepare venv and verify ROCm tooling on the self-hosted runner. |
| .github/scripts/github-runner-files/start.sh | Container entrypoint that installs/configures the Actions runner and starts the listener. |
| .github/scripts/github-runner-files/run-github-coding-agent-runner.sh | SLURM/standalone launcher to execute the runner inside an Apptainer container. |
| .github/scripts/github-runner-files/build-github-coding-agent-runner.sh | SLURM build script to produce the runner container SIF image. |
| .github/scripts/github-runner-files/cleanup-old-runners.sh | Utility to remove stale runner configs/overlay artifacts. |
| .github/scripts/github-runner-files/runner-container.env.example | Template for container environment overrides (paths/options). |
| .github/scripts/github-runner-files/env.example | Template for exporting runner-related env vars (token, repo, labels). |
| .github/scripts/github-runner-files/README.md | End-to-end documentation for building and running the self-hosted runner. |
| .github/scripts/github-runner-files/skills.md | Additional build/run notes (SLURM-focused). |
| .github/scripts/github-runner-files/AGENTS.md | Local conventions for editing these scripts (no secrets, no host-specific paths). |
| # Point git config to a writable location (can be overridden by env file) | ||
| export GIT_CONFIG_GLOBAL="${GIT_CONFIG_GLOBAL:-${RUNNER_WORKDIR}/.gitconfig}" | ||
| mkdir -p "$(dirname "$GIT_CONFIG_GLOBAL")" | ||
| git config --global --add safe.directory '*' |
There was a problem hiding this comment.
git config --global --add safe.directory '*' disables Git's ownership safety checks for all directories. On a self-hosted runner (especially if it can run arbitrary PR code), this broad trust is a security risk. Prefer adding only the specific workspace directory (or runner work subdirectory) that needs to be treated as safe instead of a wildcard.
| git config --global --add safe.directory '*' | |
| if [ -n "${RUNNER_WORKDIR:-}" ]; then | |
| git config --global --add safe.directory "${RUNNER_WORKDIR}" | |
| fi |
|
|
||
| set -e | ||
|
|
||
| WORK_DIR="${WORK:-/work1/amd/josantos}" |
There was a problem hiding this comment.
This hardcodes a host-specific absolute path (/work1/amd/josantos) as the default WORK_DIR, which conflicts with the repo’s stated convention to avoid machine-specific paths (see .github/scripts/github-runner-files/AGENTS.md:17-21). Please remove the hardcoded default and rely on env (e.g. $WORK, $HOME) or require the caller to pass a base directory.
| WORK_DIR="${WORK:-/work1/amd/josantos}" | |
| WORK_DIR="${WORK:-$HOME}" |
| --bind "${RUNNER_HOME_HOST}:/runner-home:rw" \ | ||
| --bind "${RUNNER_WORKDIR}:${RUNNER_WORKDIR}" \ | ||
| ${RUNNER_TMP_BIND:+"$RUNNER_TMP_BIND"} \ | ||
| --env "RUNNER_HOME=/runner-home" \ |
There was a problem hiding this comment.
The optional /tmp bind is passed as a single argument because RUNNER_TMP_BIND contains spaces and is expanded with quotes (${RUNNER_TMP_BIND:+"$RUNNER_TMP_BIND"}). Apptainer expects --bind and its value as separate argv tokens (or --bind=...), so this likely won't bind /tmp as intended. Consider using an argv array (or conditionally appending --bind "${RUNNER_TMP}:/tmp:rw") instead of building a spaced string.
| if [ -n "${USE_OVERLAY}" ] && [ "${USE_OVERLAY}" != "0" ] && [ -d "${OVERLAY_DIR}" ]; then | ||
| apptainer exec \ | ||
| --no-home \ | ||
| --overlay "${OVERLAY_DIR}" \ | ||
| --bind "${SCRIPT_DIR}:/runner-scripts:ro" \ |
There was a problem hiding this comment.
--overlay "${OVERLAY_DIR}" is being given a directory created via mkdir -p, but elsewhere in the repo overlays are created as an image file with apptainer overlay create (e.g. .github/scripts/container_exec.sh:71-79 uses iris_overlay_...img). Passing an empty directory here is unlikely to be a valid overlay and can cause the runner container startup to fail. Consider creating an overlay image file under RUNNER_BASE (per job) with apptainer overlay create ... overlay.img, pass that file to --overlay, and clean it up when the runner stops.
| echo "Found existing runner configuration, removing..." | ||
| ./config.sh remove --token "$REGISTRATION_TOKEN" || true |
There was a problem hiding this comment.
When an existing .runner config is present, config.sh remove is invoked with the registration token. The script later fetches a dedicated remove token (.../actions/runners/remove-token) for cleanup, which suggests the registration token may be the wrong credential for removal. If the remove step fails (currently ignored), the local .runner file can remain and config.sh may refuse to configure. Fetch and use a remove token for this removal (or ensure local config files are removed on failure) before calling config.sh --replace.
| echo "Found existing runner configuration, removing..." | |
| ./config.sh remove --token "$REGISTRATION_TOKEN" || true | |
| echo "Found existing runner configuration, requesting remove token..." | |
| PRE_REMOVE_RESPONSE=$(curl -s -X POST \ | |
| -H "Authorization: token $GITHUB_TOKEN" \ | |
| -H "Accept: application/vnd.github.v3+json" \ | |
| "https://api.github.com/repos/$GITHUB_REPOSITORY/actions/runners/remove-token") | |
| if command -v jq >/dev/null 2>&1; then | |
| PRE_REMOVE_TOKEN=$(echo "$PRE_REMOVE_RESPONSE" | jq -r .token) | |
| else | |
| PRE_REMOVE_TOKEN=$(echo "$PRE_REMOVE_RESPONSE" | grep -o '"token":"[^"]*"' | head -1 | cut -d'"' -f4) | |
| fi | |
| if [ "$PRE_REMOVE_TOKEN" != "null" ] && [ -n "$PRE_REMOVE_TOKEN" ]; then | |
| echo "Removing existing runner with remove token..." | |
| if ! ./config.sh remove --token "$PRE_REMOVE_TOKEN"; then | |
| echo "Warning: Failed to remove existing runner with remove token; deleting local .runner config." | |
| rm -f ".runner" | |
| fi | |
| else | |
| echo "Warning: Could not obtain remove token for existing runner; deleting local .runner config." | |
| rm -f ".runner" | |
| fi |
| echo "Press Ctrl+C to stop" | ||
| echo "==========================================" | ||
| command -v rocminfo >/dev/null 2>&1 && rocminfo || true | ||
| ./run.sh |
There was a problem hiding this comment.
GITHUB_TOKEN (a long-lived PAT with repo scope) is passed into the container and remains in the environment when ./run.sh starts, so every workflow job executed on this runner can read and exfiltrate it. An attacker who can trigger a workflow on this self-hosted runner (for example via a pull request workflow) could simply print $GITHUB_TOKEN in a job step and gain full API access to the repository outside of Actions. To avoid leaking this secret, keep using it only within start.sh/cleanup to obtain registration/remove tokens but ensure it is not present in the runner job environment (for example by removing or masking GITHUB_TOKEN from the environment before invoking ./run.sh or otherwise isolating the PAT from workflow steps).
|
@copilot can you run example 7 for 10 different GEMM shapes and report back the results? |
|
@mawad-amd I've opened a new pull request, #401, to work on those changes. Once the pull request is ready, I'll request review from you. |
mawad-amd
left a comment
There was a problem hiding this comment.
Not sure if we need to merge first before testing or no so feel free to merge this.
add issue_comment trigger with created, edited gate job to PR comments containing @copilot keep manual workflow_dispatch target self-hosted runner labels: copilot, apptainer, iris
|
@JoseSantosAMD I've opened a new pull request, #402, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@JoseSantosAMD I've opened a new pull request, #404, to work on those changes. Once the pull request is ready, I'll request review from you. |
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist