[WiP] CI test for automated attention benchmarking suite#897
[WiP] CI test for automated attention benchmarking suite#897amd-callumm wants to merge 4 commits intogfx11from
Conversation
227fd1f to
26cdf3c
Compare
Signed-off-by: Callum Mitchell <callumm@amd.com>
26cdf3c to
c2ed5d9
Compare
Co-authored-by: Claude Signed-off-by: Callum Mitchell <callumm@amd.com>
c2ed5d9 to
ab95a95
Compare
Co-authored-by: Claude Signed-off-by: Callum Mitchell <callumm@amd.com>
Signed-off-by: Callum Mitchell <callumm@amd.com>
ab95a95 to
6bc7e34
Compare
eble-amd
left a comment
There was a problem hiding this comment.
This looks good mainly. Three things I recommend changing:
- rename the test function
- rename
pct_change - clarify config validation
The rest could be ignored.
| def pytest_addoption(parser): | ||
| """Add custom command-line options for attention benchmark tests.""" | ||
| parser.addoption( | ||
| "--attn-bench-intermittent", |
There was a problem hiding this comment.
I don't demand that you change this before merging, but I'm giving notice that if you leave it to me, I will change these tests to use --intermittent from the top-level tests/conftest.py when I rebase my #898 onto your changes.
| def test_benchmark_regression( | ||
| config_name: str, |
There was a problem hiding this comment.
This name is probably too general; there will be other benchmark regression tests.
| ) | ||
| def test_benchmark_regression( |
| # Build lookup tables by config key | ||
| actual_by_key = {make_config_key(e): e for e in actual_results} | ||
| golden_by_key = {make_config_key(e): e for e in golden_results} | ||
| matching_keys = set(actual_by_key.keys()) & set(golden_by_key.keys()) |
There was a problem hiding this comment.
Above, the comments say,
Validation Rules:
- Entry count must match golden
- All configs in golden must exist in actual (order-independent)
but it isn't obvious where this validation occurs. It looks like cases that are unique to actual or golden are just ignored.
| if is_intermittent and not bench_intermittent: | ||
| num_skipped_intermittent += 1 | ||
| continue |
There was a problem hiding this comment.
Skip earlier, maybe? If the performance won't be validated, why spend time running the test? If there's a good reason to do it this way, it's worth a comment, otherwise someone might waste time trying to change it before discovering why it is the way it is.
| pct_change = (actual_mean - golden_mean) / golden_mean | ||
|
|
| # Subprocess timeout (seconds) | ||
| BENCHMARK_TIMEOUT = 900 |
There was a problem hiding this comment.
This is not a blocking issue (because you're not running these tests in CI), but this is longer than the --timeout 300 in build-rocm-wheels.yml.
| """ | ||
| Get absolute path to attention benchmark directory |
There was a problem hiding this comment.
According to Clod, "In practice, pytest on Python 3.12 sets file to an absolute path, so it works — but the comments overclaim what the code ensures." Calling resolve() would make the code consistent with the comment, but then it might create problems by also resolving symlinks. Is it essential to use absolute paths here? Maybe just strike that word from the comments.
Purpose
Leverage the existing pytest and (manual) attention backend benchmarking infrastructure to implement automated attention performance regression tests. Each test runs benchmark.py against a YAML config designed to imitate a model of interest (number of heads, head dimensions, etc) while defining certain batch specs (input/output tokens, batch count) and attention backends to run. The current set of tests covers the TRITON_ATTN and ROCM_AITER_UNIFIED_ATTN backends. Each model config includes a long-context prefill-only case, decode-only, and one prefill/decode combination of interest for Strix Halo. The YAML file also defines the number of warmup + benchmark iterations to run for each of these cases.
The automated tests run each model config + batch spec + backend combination, with a 10-second cooldown between each to minimize the risk of thermal GPU throttling that could lead to unstable results. Each such case's results are output to a json file under tests/kernels/attention/benchmark/output/<gfx_target>/, which is compared to a golden reference/baseline.
Currently, these tests are Strix Halo only, but the infrastructure can easily support other platforms such as Strix Point.
Test cases can be marked as "skip" to avoid running the benchmarks, or "intermittent" to mark tests as working, but with unstable performance. Intermittent cases' performance will only be compared to the golden
For now, these tests are not run in any CI job (similar to @eble-amd, I saw far slower performance running on the CI machine compared to my local one; until this is understood, the CI job will not be useful).
Test command
pytest tests/kernels/attention/benchmark/test_benchmark_attention.py::test_benchmark_regression [--attn-bench-intermittent]Test Result
After 5 consecutive runs on my local machine, out of 30 test cases (5 model configs * 3 batch cases * 2 backends), all showed less than 10% variance in the mean time-per-iteration compared to goldens. 27 of these showed less than 1% variance in all 5 runs.
No tests currently require the skip or intermittent flags, but both of these flags have been manually validated during development.
During test runs, I monitored my Strix Halo machine's GPU temperature at 5-second intervals and found that a 10-second cooldown was sufficient to keep the edge temperature below 65°C even with repeated runs of the test suite; well below typical thermal throttling thresholds. I haven't tried to push this interval any lower.