Skip to content

[WiP] CI test for automated attention benchmarking suite#897

Open
amd-callumm wants to merge 4 commits intogfx11from
callumm.attn_bench_test_ci
Open

[WiP] CI test for automated attention benchmarking suite#897
amd-callumm wants to merge 4 commits intogfx11from
callumm.attn_bench_test_ci

Conversation

@amd-callumm
Copy link
Copy Markdown

@amd-callumm amd-callumm commented Apr 24, 2026

Purpose

Leverage the existing pytest and (manual) attention backend benchmarking infrastructure to implement automated attention performance regression tests. Each test runs benchmark.py against a YAML config designed to imitate a model of interest (number of heads, head dimensions, etc) while defining certain batch specs (input/output tokens, batch count) and attention backends to run. The current set of tests covers the TRITON_ATTN and ROCM_AITER_UNIFIED_ATTN backends. Each model config includes a long-context prefill-only case, decode-only, and one prefill/decode combination of interest for Strix Halo. The YAML file also defines the number of warmup + benchmark iterations to run for each of these cases.

The automated tests run each model config + batch spec + backend combination, with a 10-second cooldown between each to minimize the risk of thermal GPU throttling that could lead to unstable results. Each such case's results are output to a json file under tests/kernels/attention/benchmark/output/<gfx_target>/, which is compared to a golden reference/baseline.

Currently, these tests are Strix Halo only, but the infrastructure can easily support other platforms such as Strix Point.

Test cases can be marked as "skip" to avoid running the benchmarks, or "intermittent" to mark tests as working, but with unstable performance. Intermittent cases' performance will only be compared to the golden

For now, these tests are not run in any CI job (similar to @eble-amd, I saw far slower performance running on the CI machine compared to my local one; until this is understood, the CI job will not be useful).

Test command

pytest tests/kernels/attention/benchmark/test_benchmark_attention.py::test_benchmark_regression [--attn-bench-intermittent]

Test Result

After 5 consecutive runs on my local machine, out of 30 test cases (5 model configs * 3 batch cases * 2 backends), all showed less than 10% variance in the mean time-per-iteration compared to goldens. 27 of these showed less than 1% variance in all 5 runs.

No tests currently require the skip or intermittent flags, but both of these flags have been manually validated during development.

During test runs, I monitored my Strix Halo machine's GPU temperature at 5-second intervals and found that a 10-second cooldown was sufficient to keep the edge temperature below 65°C even with repeated runs of the test suite; well below typical thermal throttling thresholds. I haven't tried to push this interval any lower.

@amd-callumm amd-callumm force-pushed the callumm.attn_bench_test_ci branch 9 times, most recently from 227fd1f to 26cdf3c Compare April 28, 2026 20:58
Signed-off-by: Callum Mitchell <callumm@amd.com>
@amd-callumm amd-callumm force-pushed the callumm.attn_bench_test_ci branch from 26cdf3c to c2ed5d9 Compare April 28, 2026 23:14
Co-authored-by: Claude
Signed-off-by: Callum Mitchell <callumm@amd.com>
@amd-callumm amd-callumm force-pushed the callumm.attn_bench_test_ci branch from c2ed5d9 to ab95a95 Compare April 28, 2026 23:37
Co-authored-by: Claude
Signed-off-by: Callum Mitchell <callumm@amd.com>
Signed-off-by: Callum Mitchell <callumm@amd.com>
@amd-callumm amd-callumm force-pushed the callumm.attn_bench_test_ci branch from ab95a95 to 6bc7e34 Compare April 28, 2026 23:54
@amd-callumm amd-callumm marked this pull request as ready for review April 28, 2026 23:58
Copy link
Copy Markdown

@eble-amd eble-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good mainly. Three things I recommend changing:

  • rename the test function
  • rename pct_change
  • clarify config validation

The rest could be ignored.

Comment on lines +14 to +17
def pytest_addoption(parser):
"""Add custom command-line options for attention benchmark tests."""
parser.addoption(
"--attn-bench-intermittent",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't demand that you change this before merging, but I'm giving notice that if you leave it to me, I will change these tests to use --intermittent from the top-level tests/conftest.py when I rebase my #898 onto your changes.

Comment on lines +415 to +416
def test_benchmark_regression(
config_name: str,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is probably too general; there will be other benchmark regression tests.

Comment on lines +414 to +415
)
def test_benchmark_regression(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #898 proposes a @pytest.mark.benchmark marker that can be used to filter tests. If you don't wish to bring it into this PR, that's no problem; I'll mark these as benchmarks when I rebase #898 onto your changes.

Comment on lines +502 to +505
# Build lookup tables by config key
actual_by_key = {make_config_key(e): e for e in actual_results}
golden_by_key = {make_config_key(e): e for e in golden_results}
matching_keys = set(actual_by_key.keys()) & set(golden_by_key.keys())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, the comments say,

Validation Rules:
    - Entry count must match golden
    - All configs in golden must exist in actual (order-independent)

but it isn't obvious where this validation occurs. It looks like cases that are unique to actual or golden are just ignored.

Comment on lines +526 to +528
if is_intermittent and not bench_intermittent:
num_skipped_intermittent += 1
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip earlier, maybe? If the performance won't be validated, why spend time running the test? If there's a good reason to do it this way, it's worth a comment, otherwise someone might waste time trying to change it before discovering why it is the way it is.

Comment on lines +552 to +553
pct_change = (actual_mean - golden_mean) / golden_mean

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really a percentage

Comment on lines +87 to +88
# Subprocess timeout (seconds)
BENCHMARK_TIMEOUT = 900
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocking issue (because you're not running these tests in CI), but this is longer than the --timeout 300 in build-rocm-wheels.yml.

Comment on lines +119 to +120
"""
Get absolute path to attention benchmark directory
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Clod, "In practice, pytest on Python 3.12 sets file to an absolute path, so it works — but the comments overclaim what the code ensures." Calling resolve() would make the code consistent with the comment, but then it might create problems by also resolving symlinks. Is it essential to use absolute paths here? Maybe just strike that word from the comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants