feat(ci): add test failure retry support to auto-rerun workflow#16446
Open
radical wants to merge 6 commits intomicrosoft:mainfrom
Open
feat(ci): add test failure retry support to auto-rerun workflow#16446radical wants to merge 6 commits intomicrosoft:mainfrom
radical wants to merge 6 commits intomicrosoft:mainfrom
Conversation
…pattern matching Add eng/test-retry-patterns.json with initial transient failure patterns (ECONNRESET, DNS failures, SSL errors, timeouts, Windows 0xC0000142). Add pattern matching functions to auto-rerun-transient-ci-failures.js: - loadRetryPatternsConfig: reads and validates JSON config - validateRetryPatternsConfig: schema validation + regex compilation - extractFailedTestsFromTrx: regex-based TRX XML parsing - matchesRetryPattern: string/regex matching with case-insensitive support - matchTestFailurePatterns: AND-within/OR-across rule matching - matchJobLogPattern: job name + log text pattern matching Add 30 tests in Infrastructure.Tests covering: - Config JSON structure and schema validation (C#) - Regex compilation validation via Node.js harness (V8 engine) - Pattern matching: substring, regex, AND/OR logic, disabled rules - TRX parsing: failed test extraction, output cap, XML entity decoding - Validation edge cases: unknown props, wrong version, missing reason Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the auto-rerun-transient-ci-failures workflow to detect transient test failures in addition to infrastructure failures. Two new matching paths: 1. Job log pattern matching: analyzeFailedJobs now accepts an optional retryPatternsConfig and runs a 3rd classification pass using matchJobLogPattern for test-execution-failure jobs. 2. TRX-based pattern matching: After job classification, the YAML workflow downloads the All-TestResults artifact, extracts .trx files, and matches failed test output against testFailurePatterns from eng/test-retry-patterns.json. When matches are found, all skipped test-execution-failure jobs are promoted to retryable. New exported JS functions: - hasTestExecutionFailureStep: checks if a job has test execution steps - analyzeTrxFiles: parses TRX contents, matches against patterns, dedupes - promoteTestExecutionFailureJobs: pure function to move jobs to retryable - selectTestResultsArtifact: picks newest non-expired artifact under cap Safety rails: - Existing maxRetryableJobs cap (default 5) applies to promoted jobs - 3-attempt budget shared with infrastructure retries - Artifact download failures are non-fatal - 100MB artifact size cap, 200 TRX file limit, 50MB per-file limit Updated summary and PR comment formatting to distinguish infrastructure retries from test-pattern retries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restructure the documentation from a behavior contract into a user-facing guide that explains: - How the rerun system works at a glance (flow diagram) - The four analysis passes and what each does - When it triggers (automatic vs manual) - How to add/modify test failure retry patterns in eng/test-retry-patterns.json with worked examples - Rule field reference tables for both pattern types - Matching semantics (AND/OR, substring vs regex, dedup) - Tips for writing good patterns - How to verify with dry run - Safety rails summarized in a table - Architecture and file layout - How to run the tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add patterns to detect transient MCR (mcr.microsoft.com) rate limiting
failures that cause both test and infrastructure CI failures:
testFailurePatterns:
- MCR 403 Forbidden (regex scoped to mcr.microsoft.com)
- MCR 'The request is blocked' HTML response (regex scoped)
- CONTAINER1016 (.NET SDK container publish failure)
- 'pull access denied for mcr.microsoft.com' (Docker pull denial)
jobFailurePatterns:
- MCR 403 Forbidden (regex scoped to mcr.microsoft.com)
- MCR 'The request is blocked' HTML response (regex scoped)
Regex patterns use [\s\S]{0,500} to require mcr.microsoft.com within
500 chars of the error text, preventing false matches on non-MCR 403s.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mments - Fix decodeXmlEntities replacement ordering: move & decode to last position to prevent double-decoding of &quot; and &apos; - Add sanitizeMarkdown helper to escape backticks/pipes in test names rendered in PR comment markdown (defense-in-depth) - Add test covering double-encoded XML entity decoding Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 16446Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 16446" |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds configurable retry-safe matching for test execution failures in the “auto-rerun transient CI failures” workflow by introducing a versioned retry-patterns JSON config, TRX artifact analysis, and expanded test coverage/documentation.
Changes:
- Introduces
eng/test-retry-patterns.jsonto define retry-safe test/job failure patterns (substring or regex + reasons). - Extends the workflow to load the config, match job logs for test failures, download/unpack
All-TestResults, parse TRX files, and promote test-execution-failure jobs when patterns match. - Adds extensive Infrastructure.Tests coverage plus updated documentation describing the new matching passes and configuration format.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
tests/Infrastructure.Tests/WorkflowScripts/auto-rerun-transient-ci-failures.harness.js |
Expands the Node harness surface area to invoke new JS helper functions from C# tests. |
tests/Infrastructure.Tests/WorkflowScripts/AutoRerunTransientCiFailuresTests.cs |
Adds many xUnit tests for config validation, matching semantics, TRX parsing, artifact selection, and promotion logic. |
eng/test-retry-patterns.json |
Adds the initial curated set of retry-safe patterns for test output and job logs. |
docs/ci/auto-rerun-transient-ci-failures.md |
Documents the new multi-pass analysis and how to configure retry patterns. |
.github/workflows/auto-rerun-transient-ci-failures.yml |
Loads retry-pattern config, performs job-log + TRX-based analysis, and forwards matched test info to the rerun/comment step. |
.github/workflows/auto-rerun-transient-ci-failures.js |
Implements config loading/validation and the new pattern matching + TRX parsing/promotion utilities. |
- Precompile regex patterns during config load instead of on every match; invalid patterns log warnings and disable the rule (comments #2-3) - Add note to doc examples clarifying they are snippets, not standalone configs (comment #1) - Dispose JsonDocument with 'using' in 3 test methods (comments #4-6) - Guard artifact extraction against zip-slip and symlink attacks by skipping symlinks and verifying resolved paths stay within trxDir (comment #7) - Cap test_pattern_matched_tests output to 50 entries and drop unused testProject field to avoid GH Actions size limits (comment #8) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds test failure retry support to the auto-rerun transient CI failures workflow. When a CI job fails due to test execution failures, the workflow now downloads TRX test result artifacts, parses them, and matches individual test failures against configurable retry-safe patterns (e.g., network errors, MCR rate limiting).
The design of the retry patterns configuration was inspired by arcade's test configuration JSON schema.
Retry Patterns Config (
eng/test-retry-patterns.json)The new
eng/test-retry-patterns.jsonfile defines which test and job failures are safe to automatically retry. It uses a declarative JSON format with two pattern arrays:Structure
{ "version": 1, "testFailurePatterns": [ ... ], "jobFailurePatterns": [ ... ] }Each pattern has:
output— a substring or{ "regex": "..." }matched against test output or job logsreason— human-readable explanation shown in PR comments and workflow summariestestName(optional, test patterns only) — filter by test name (substring or regex)testProject(optional, test patterns only) — filter by test project/assembly namejobName(optional, job patterns only) — filter by job name (substring or regex)Example: Simple substring match
Match any test whose output contains
ECONNRESET:{ "output": "ECONNRESET", "reason": "Transient network connection reset" }Example: Regex pattern
Match MCR rate-limiting responses (403 within ~500 chars of the MCR URL):
{ "output": { "regex": "mcr\\.microsoft\\.com[\\s\\S]{0,500}403 Forbidden" }, "reason": "MCR registry rate limiting (HTTP 403)" }Example: Job-scoped pattern with job name filter
Match the Windows process init failure only on Windows jobs:
{ "jobName": { "regex": ".*windows.*" }, "output": "0xC0000142", "reason": "Windows process initialization failure (0xC0000142)" }Example: Test-scoped pattern with test name filter
Match a specific flaky test by name and output:
{ "testName": { "regex": ".*MyFlakyTest.*" }, "output": "Connection refused", "reason": "Known transient failure in MyFlakyTest" }Adding new patterns
eng/test-retry-patterns.jsontestFailurePatternsorjobFailurePatternsreason, or invalid regex will produce clear error messagesdotnet test --project tests/Infrastructure.Tests/Infrastructure.Tests.csprojChecklist