Skip to content

feat(ci): add test failure retry support to auto-rerun workflow#16446

Open
radical wants to merge 6 commits intomicrosoft:mainfrom
radical:test-retry
Open

feat(ci): add test failure retry support to auto-rerun workflow#16446
radical wants to merge 6 commits intomicrosoft:mainfrom
radical:test-retry

Conversation

@radical
Copy link
Copy Markdown
Member

@radical radical commented Apr 24, 2026

Description

Adds test failure retry support to the auto-rerun transient CI failures workflow. When a CI job fails due to test execution failures, the workflow now downloads TRX test result artifacts, parses them, and matches individual test failures against configurable retry-safe patterns (e.g., network errors, MCR rate limiting).

The design of the retry patterns configuration was inspired by arcade's test configuration JSON schema.

Retry Patterns Config (eng/test-retry-patterns.json)

The new eng/test-retry-patterns.json file defines which test and job failures are safe to automatically retry. It uses a declarative JSON format with two pattern arrays:

Structure

{
  "version": 1,
  "testFailurePatterns": [ ... ],
  "jobFailurePatterns": [ ... ]
}

Each pattern has:

  • output — a substring or { "regex": "..." } matched against test output or job logs
  • reason — human-readable explanation shown in PR comments and workflow summaries
  • testName (optional, test patterns only) — filter by test name (substring or regex)
  • testProject (optional, test patterns only) — filter by test project/assembly name
  • jobName (optional, job patterns only) — filter by job name (substring or regex)

Example: Simple substring match

Match any test whose output contains ECONNRESET:

{
  "output": "ECONNRESET",
  "reason": "Transient network connection reset"
}

Example: Regex pattern

Match MCR rate-limiting responses (403 within ~500 chars of the MCR URL):

{
  "output": { "regex": "mcr\\.microsoft\\.com[\\s\\S]{0,500}403 Forbidden" },
  "reason": "MCR registry rate limiting (HTTP 403)"
}

Example: Job-scoped pattern with job name filter

Match the Windows process init failure only on Windows jobs:

{
  "jobName": { "regex": ".*windows.*" },
  "output": "0xC0000142",
  "reason": "Windows process initialization failure (0xC0000142)"
}

Example: Test-scoped pattern with test name filter

Match a specific flaky test by name and output:

{
  "testName": { "regex": ".*MyFlakyTest.*" },
  "output": "Connection refused",
  "reason": "Known transient failure in MyFlakyTest"
}

Adding new patterns

  1. Edit eng/test-retry-patterns.json
  2. Add a new entry to testFailurePatterns or jobFailurePatterns
  3. The config is validated at runtime — unknown properties, missing reason, or invalid regex will produce clear error messages
  4. Run the Infrastructure.Tests to verify: dotnet test --project tests/Infrastructure.Tests/Infrastructure.Tests.csproj

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
  • Did you add public API?
    • No
  • Does the change make any security assumptions or guarantees?
    • No
  • Does the change require an update in our Aspire docs?
    • No

radical and others added 5 commits April 24, 2026 16:58
…pattern matching

Add eng/test-retry-patterns.json with initial transient failure patterns
(ECONNRESET, DNS failures, SSL errors, timeouts, Windows 0xC0000142).

Add pattern matching functions to auto-rerun-transient-ci-failures.js:
- loadRetryPatternsConfig: reads and validates JSON config
- validateRetryPatternsConfig: schema validation + regex compilation
- extractFailedTestsFromTrx: regex-based TRX XML parsing
- matchesRetryPattern: string/regex matching with case-insensitive support
- matchTestFailurePatterns: AND-within/OR-across rule matching
- matchJobLogPattern: job name + log text pattern matching

Add 30 tests in Infrastructure.Tests covering:
- Config JSON structure and schema validation (C#)
- Regex compilation validation via Node.js harness (V8 engine)
- Pattern matching: substring, regex, AND/OR logic, disabled rules
- TRX parsing: failed test extraction, output cap, XML entity decoding
- Validation edge cases: unknown props, wrong version, missing reason

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the auto-rerun-transient-ci-failures workflow to detect transient
test failures in addition to infrastructure failures. Two new matching
paths:

1. Job log pattern matching: analyzeFailedJobs now accepts an optional
   retryPatternsConfig and runs a 3rd classification pass using
   matchJobLogPattern for test-execution-failure jobs.

2. TRX-based pattern matching: After job classification, the YAML
   workflow downloads the All-TestResults artifact, extracts .trx files,
   and matches failed test output against testFailurePatterns from
   eng/test-retry-patterns.json. When matches are found, all skipped
   test-execution-failure jobs are promoted to retryable.

New exported JS functions:
- hasTestExecutionFailureStep: checks if a job has test execution steps
- analyzeTrxFiles: parses TRX contents, matches against patterns, dedupes
- promoteTestExecutionFailureJobs: pure function to move jobs to retryable
- selectTestResultsArtifact: picks newest non-expired artifact under cap

Safety rails:
- Existing maxRetryableJobs cap (default 5) applies to promoted jobs
- 3-attempt budget shared with infrastructure retries
- Artifact download failures are non-fatal
- 100MB artifact size cap, 200 TRX file limit, 50MB per-file limit

Updated summary and PR comment formatting to distinguish infrastructure
retries from test-pattern retries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restructure the documentation from a behavior contract into a
user-facing guide that explains:

- How the rerun system works at a glance (flow diagram)
- The four analysis passes and what each does
- When it triggers (automatic vs manual)
- How to add/modify test failure retry patterns in
  eng/test-retry-patterns.json with worked examples
- Rule field reference tables for both pattern types
- Matching semantics (AND/OR, substring vs regex, dedup)
- Tips for writing good patterns
- How to verify with dry run
- Safety rails summarized in a table
- Architecture and file layout
- How to run the tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add patterns to detect transient MCR (mcr.microsoft.com) rate limiting
failures that cause both test and infrastructure CI failures:

testFailurePatterns:
- MCR 403 Forbidden (regex scoped to mcr.microsoft.com)
- MCR 'The request is blocked' HTML response (regex scoped)
- CONTAINER1016 (.NET SDK container publish failure)
- 'pull access denied for mcr.microsoft.com' (Docker pull denial)

jobFailurePatterns:
- MCR 403 Forbidden (regex scoped to mcr.microsoft.com)
- MCR 'The request is blocked' HTML response (regex scoped)

Regex patterns use [\s\S]{0,500} to require mcr.microsoft.com within
500 chars of the error text, preventing false matches on non-MCR 403s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mments

- Fix decodeXmlEntities replacement ordering: move &amp; decode to last
  position to prevent double-decoding of &amp;quot; and &amp;apos;
- Add sanitizeMarkdown helper to escape backticks/pipes in test names
  rendered in PR comment markdown (defense-in-depth)
- Add test covering double-encoded XML entity decoding

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 16446

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 16446"

@radical radical marked this pull request as ready for review April 24, 2026 22:04
Copilot AI review requested due to automatic review settings April 24, 2026 22:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable retry-safe matching for test execution failures in the “auto-rerun transient CI failures” workflow by introducing a versioned retry-patterns JSON config, TRX artifact analysis, and expanded test coverage/documentation.

Changes:

  • Introduces eng/test-retry-patterns.json to define retry-safe test/job failure patterns (substring or regex + reasons).
  • Extends the workflow to load the config, match job logs for test failures, download/unpack All-TestResults, parse TRX files, and promote test-execution-failure jobs when patterns match.
  • Adds extensive Infrastructure.Tests coverage plus updated documentation describing the new matching passes and configuration format.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/Infrastructure.Tests/WorkflowScripts/auto-rerun-transient-ci-failures.harness.js Expands the Node harness surface area to invoke new JS helper functions from C# tests.
tests/Infrastructure.Tests/WorkflowScripts/AutoRerunTransientCiFailuresTests.cs Adds many xUnit tests for config validation, matching semantics, TRX parsing, artifact selection, and promotion logic.
eng/test-retry-patterns.json Adds the initial curated set of retry-safe patterns for test output and job logs.
docs/ci/auto-rerun-transient-ci-failures.md Documents the new multi-pass analysis and how to configure retry patterns.
.github/workflows/auto-rerun-transient-ci-failures.yml Loads retry-pattern config, performs job-log + TRX-based analysis, and forwards matched test info to the rerun/comment step.
.github/workflows/auto-rerun-transient-ci-failures.js Implements config loading/validation and the new pattern matching + TRX parsing/promotion utilities.

Comment thread docs/ci/auto-rerun-transient-ci-failures.md
Comment thread docs/ci/auto-rerun-transient-ci-failures.md Outdated
Comment thread .github/workflows/auto-rerun-transient-ci-failures.js
Comment thread tests/Infrastructure.Tests/WorkflowScripts/AutoRerunTransientCiFailuresTests.cs Outdated
Comment thread tests/Infrastructure.Tests/WorkflowScripts/AutoRerunTransientCiFailuresTests.cs Outdated
Comment thread tests/Infrastructure.Tests/WorkflowScripts/AutoRerunTransientCiFailuresTests.cs Outdated
Comment thread .github/workflows/auto-rerun-transient-ci-failures.yml
Comment thread .github/workflows/auto-rerun-transient-ci-failures.yml Outdated
@radical radical requested a review from sebastienros April 24, 2026 22:22
@radical radical added the area-engineering-systems infrastructure helix infra engineering repo stuff label Apr 24, 2026
- Precompile regex patterns during config load instead of on every match;
  invalid patterns log warnings and disable the rule (comments #2-3)
- Add note to doc examples clarifying they are snippets, not standalone
  configs (comment #1)
- Dispose JsonDocument with 'using' in 3 test methods (comments #4-6)
- Guard artifact extraction against zip-slip and symlink attacks by
  skipping symlinks and verifying resolved paths stay within trxDir
  (comment #7)
- Cap test_pattern_matched_tests output to 50 entries and drop unused
  testProject field to avoid GH Actions size limits (comment #8)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-engineering-systems infrastructure helix infra engineering repo stuff

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants