Skip to content

Add per-file download retry with exponential backoff#63

Open
acolomba wants to merge 14 commits intomainfrom
claude/download-resiliency
Open

Add per-file download retry with exponential backoff#63
acolomba wants to merge 14 commits intomainfrom
claude/download-resiliency

Conversation

@acolomba
Copy link
Copy Markdown
Owner

Summary

  • Adds --retry-count CLI option (default 3) controlling download attempts per file
  • Wraps HTTP download in retry loop with exponential backoff (1s, 2s, 4s) for transient network errors (URLError, socket.timeout, http.client.HTTPException)
  • Permanent errors (HTTPError) still immediately create .failed markers with no retry, preserving existing --retry-failed-after semantics
  • socket.timeout during file download no longer kills the entire sync run

Test plan

  • 6 new unit tests covering retry success, exhaustion, HTTPError bypass, socket.timeout handling, and retry-count=1
  • 1 new integration scenario verifying transient errors recover within a single sync run
  • All 118 unit tests pass
  • All 22 integration scenarios pass
  • All pre-commit hooks pass
  • Verify Docker image builds and RETRY_COUNT env var works

🤖 Generated with Claude Code

acolomba and others added 14 commits February 28, 2026 11:37
Defines per-file retry with exponential backoff for transient
network errors, while preserving existing .failed marker
semantics for permanent HTTP errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Seven tasks covering CLI option, retry loop, unit tests,
mock server transient errors, integration tests, docs,
and final verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Also catch http.client.HTTPException in the retry loop so
that IncompleteRead (partial response / connection drop) is
treated as a transient error and retried.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers http.client.IncompleteRead as a retryable transient
error and verifies retry_count < 1 is rejected by main().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add TransientDownloadError raised when download retries exhaust
- Circuit breaker in sync() aborts after 3 consecutive transient
  recording failures (raises UserWarning for clean exit)
- Widen transient error catch to include OSError (covers
  ConnectionResetError, BrokenPipeError, ConnectionAbortedError)
- Clean up partial temp file after exhausted retries
- Replace unreachable return with AssertionError
- Fix comment style to third-person per project guidelines
- Fix backoff sequence in design doc (1s, 2s not 1s, 2s, 4s)
- Add unit tests for OSError retry, circuit breaker, temp cleanup,
  backoff values
- Add integration test for circuit breaker abort scenario

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove redundant URLError and socket.timeout from transient
exception catch (both are OSError subclasses). Extract _fetch_file()
helper to reduce download_file cognitive complexity below 15.
Fix entrypoint.sh to propagate blackvuesync.sh exit code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Dockerfile defaults CRON=1, so Docker integration tests were
always running with --cron even when not intended. This caused
the circuit breaker test to expect exit code 1 but get 0, since
UserWarning returns 0 in cron mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add venv binary usage guidance to CLAUDE.md from Codex
template. Move pytest permission to shared settings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Mar 3, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant