Skip to content

Add retry to gRPC calls that failed due to transient errors#714

Open
sophiatev wants to merge 36 commits intomainfrom
stevosyan/add-retry-to-complete-calls
Open

Add retry to gRPC calls that failed due to transient errors#714
sophiatev wants to merge 36 commits intomainfrom
stevosyan/add-retry-to-complete-calls

Conversation

@sophiatev
Copy link
Copy Markdown
Contributor

Summary

What changed?

As per the PR title, this PR adds retry logic to gRPC calls in the worker process that failed due to transient errors (i.e. StatusCode.Unavailable)

Why is this change needed?

Previously the call would just fail, which means the work item would be abandoned and only picked up again after a relatively long delay. We want to retry pretty immediately for these transient errors.

Copilot AI review requested due to automatic review settings April 27, 2026 16:58
@sophiatev
Copy link
Copy Markdown
Contributor Author

@copilot add tests for the new retry logic in this PR

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds retry logic around worker gRPC calls when they fail with transient transport errors, with accompanying structured logging for each retry attempt.

Changes:

  • Wrap multiple gRPC client calls (abandon/complete operations) in a shared retry helper with exponential backoff + jitter.
  • Add a new warning log event to record transient gRPC retry attempts and backoff duration.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/Worker/Grpc/Logs.cs Adds a new structured log message for retry attempts.
src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Introduces ExecuteWithRetryAsync and applies it to multiple gRPC calls.

Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/Logs.cs
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 17:10
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 27, 2026

@copilot add tests for the new retry logic in this PR

Added tests for ExecuteWithRetryAsync in test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs (commit 61f8a21). The tests cover:

  • Success on first attempt – verifies no retry occurs when the action succeeds
  • Retry on all transient status codes (Unavailable, Unknown, DeadlineExceeded, Internal) – verifies the action is retried and eventually succeeds
  • No retry on non-transient errors (InvalidArgument, AlreadyExists, PermissionDenied) – verifies the exception propagates immediately
  • Cancellation during retry delay – verifies OperationCanceledException is thrown when the token is cancelled while waiting between retries
  • Log messages on retry – verifies TransientGrpcRetry log entries are emitted with the correct operation name and attempt count
  • Multiple consecutive retries – verifies each retry attempt is independently logged

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs
Comment thread test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs Fixed
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 17:32
Comment thread test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Sophia Tevosyan and others added 3 commits April 27, 2026 10:48
…member'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 17:49
Comment thread test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread src/Worker/Grpc/Logs.cs
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Copilot AI review requested due to automatic review settings April 27, 2026 19:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/Worker/Grpc/GrpcBackoff.cs:37

  • The XML doc summary for GrpcBackoff.Compute still describes it as computing a “full-jitter exponential backoff delay”, but the method now supports both full jitter and biased jitter via the fullJitter parameter. Consider updating the summary/remarks to describe both modes so the documentation matches the implementation.

Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs
…ve from call chain

Agent-Logs-Url: https://github.com/microsoft/durabletask-dotnet/sessions/16ec67ea-9c12-4fad-8ef9-325eb2985d5d

Co-authored-by: sophiatev <38052607+sophiatev@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 27, 2026

@copilot add tests for the new retry logic in this PR

Tests were already added in earlier commits (test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs). The suite covers: success on first attempt, retry on all 4 transient status codes, no retry on non-transient errors, cancellation during retry delay, log emission with status code assertion, multiple consecutive retries with per-attempt log assertion, and max-attempts exhaustion.

Copilot AI review requested due to automatic review settings April 27, 2026 20:06
@sophiatev
Copy link
Copy Markdown
Contributor Author

@copilot now add integration tests for this retry functionality

Sophia Tevosyan and others added 3 commits April 27, 2026 13:50
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 27, 2026

@copilot now add integration tests for this retry functionality

Added integration-level retry tests in commit 1379dd4 in RunBackgroundTaskLoggingTests.cs. These tests exercise the full RunBackgroundTask path with a mocked gRPC client:

  • Retries_Abandon_Orchestrator_On_Transient_Error_Eventually_Succeeds — verifies that a transient Unavailable error on the first abandon call is retried, eventually succeeds, and a TransientGrpcRetry warning is logged.
  • Retries_Abandon_Activity_On_Transient_Error_Eventually_Succeeds — same scenario for activity work items.
  • Retries_Abandon_Orchestrator_Until_MaxAttempts_Then_Fails — verifies that after all retry attempts are exhausted, the abandon exception is surfaced, the Abandoned log is not emitted, and maxAttempts retry warnings were logged.

The commit also fixes the unit test ExecuteWithRetryAsync_TransientErrorExceedsMaxAttempts_ThrowsLastRpcException which was expecting the wrong call count (the loop makes maxAttempts + 1 total calls — maxAttempts retried + 1 final propagating), and adds fast-option support to CreateProcessor to avoid long delays in that test.

Comment thread test/Worker/Grpc.Tests/RunBackgroundTaskLoggingTests.cs
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants