Skip to content

fix(test): remove time.Sleep from tests for deterministic synchronization#1968

Open
xmtp-coder-agent wants to merge 3 commits intoxmtp:mainfrom
xmtp-coder-agent:fix/issue-1963-sleeps
Open

fix(test): remove time.Sleep from tests for deterministic synchronization#1968
xmtp-coder-agent wants to merge 3 commits intoxmtp:mainfrom
xmtp-coder-agent:fix/issue-1963-sleeps

Conversation

@xmtp-coder-agent
Copy link
Copy Markdown
Collaborator

@xmtp-coder-agent xmtp-coder-agent commented Apr 10, 2026

Resolves #1963

Follow-up to #1966 that addresses mkysel's directive: "Make a new PR and get rid of the sleeps in tests." Replaces wall-clock time.Sleep calls in tests (and the production-test-support code they drive) with deterministic synchronization patterns so the suite no longer depends on timing for correctness.

Patterns applied

  • Injectable clock (now func() time.Time) in:
    • pkg/ratelimiter/redis_limiter.go — Lua script takes the client-provided timestamp, so overriding now fully controls refill math
    • pkg/ratelimiter/circuit_breaker.go
    • pkg/blockchain/oracle/oracle.go
    • pkg/tracing/context_store.go
    • pkg/testutils/redis/redis.go
    • Exposed to _test packages via export_test.go SetNowForTest(...) helpers so the public API stays clean.
  • Context-cancellation instead of delay: pkg/api/payer/publish_test.go slowServer now blocks on <-ctx.Done() driven by the client deadline instead of sleeping for a configured duration.
  • Channel barriers / require.Eventually for concurrent coordination across pkg/api/message, pkg/api/metadata, pkg/payerreport, pkg/db/sequences_test.go, pkg/db/pgx_test.go.
  • pg_stat_activity polling in pkg/db/payer_test.go race-condition test: waits for wait_event_type = 'Lock' on a payers query to deterministically detect that the concurrent FindOrCreatePayerWithRetry goroutine is blocked on T1's unique-index lock before committing T1.
  • Semaphore-aware barrier: pkg/blockchain/noncemanager/manager_test.go TestSimultaneousAllocation now caps numGoroutines = BestGuessConcurrency so all goroutines can actually reach the barrier (SQL/Redis backends cap concurrent nonce holders at 32 via semaphore — 50 goroutines deadlocked the prior barrier).
  • Error-tolerant require.Never predicates: e.g. pkg/payerreport/workers/integration_test.go TestCanGenerateReport now inlines the QueryEnvelopes RPC inside the predicate and returns false on error instead of calling a helper that fails the whole test on transient context-cancel errors.
  • Removed cargo-cult sleeps in pkg/api/payer/selectors/node_selector_test.go — the first GetNode call performs TCP latency probing synchronously inside updateLatencyCache, so the pre-call sleeps were doing nothing.

Files touched

pkg/api/message/export_test.go
pkg/api/message/publish_test.go
pkg/api/message/subscribe_test.go
pkg/api/message/subscribe_topics_test.go
pkg/api/metadata/cursor_test.go
pkg/api/payer/publish_test.go
pkg/api/payer/selectors/node_selector_test.go
pkg/blockchain/noncemanager/manager_test.go
pkg/blockchain/noncemanager/redis/manager_test.go
pkg/blockchain/oracle/oracle.go
pkg/blockchain/oracle/oracle_test.go
pkg/db/payer_test.go
pkg/db/pgx_test.go
pkg/db/sequences_test.go
pkg/payerreport/store_test.go
pkg/payerreport/workers/integration_test.go
pkg/payerreport/workers/settlement_test.go
pkg/ratelimiter/circuit_breaker.go
pkg/ratelimiter/circuit_breaker_test.go
pkg/ratelimiter/export_test.go   (new)
pkg/ratelimiter/redis_limiter.go
pkg/ratelimiter/redis_limiter_test.go
pkg/testutils/redis/redis.go
pkg/tracing/context_store.go
pkg/tracing/tracing_test.go

Verification

  • grep -rn 'time\.Sleep' --include='*_test.go' pkg/ returns only a comment reference (no live calls).
  • dev/lint-fix — clean (0 issues).
  • go build ./... — clean.
  • Stress runs of previously-sleeping / previously-flaky tests (-count/-race):
    • TestSimultaneousAllocation (x3)
    • TestFindOrCreatePayerWithRetry/race_condition (x10)
    • TestCanGenerateReport (x5)
    • pkg/api/message (x3)
    • Affected ratelimiter / oracle / tracing suites

Test plan

  • CI green on all packages
  • Full suite race-run clean
  • No remaining time.Sleep in *_test.go except comments

🤖 Generated with Claude Code

Note

Remove time.Sleep calls from tests and replace with deterministic synchronization

  • Replaces time.Sleep calls across the test suite with barrier-based synchronization: WaitGroups, channels, polling loops, and require.Never assertions.
  • Adds dispatchedMet and advanceDispatched to subscribeWorker and new test helpers AwaitCursor/AwaitGlobalListeners to message.Service so subscribe tests can block until the worker has dispatched specific envelopes.
  • Injects a fake now function into CircuitBreaker, RedisLimiter, and TraceContextStore so timing-sensitive tests advance a fake clock instead of sleeping.
  • Adds WithMaxStaleDuration option to oracle.New so oracle tests can force cache staleness without wall-clock delays.
  • Changes registrant.SignStagedEnvelope to derive expiry from OriginatorTime rather than time.Now(), making signed envelopes deterministic. Behavioral Change: expiry timestamps in signed envelopes are now relative to the originator's time, not the signing time.

Macroscope summarized f415714.

xmtp-coder-agent and others added 3 commits April 10, 2026 18:06
…tion

Replaces wall-clock waits in test and production-test-support code with
deterministic patterns (injectable clocks, channel barriers, context
cancellation, require.Eventually/Never, pg_stat_activity polling) so the
suite does not depend on timing for correctness.

Notable patterns:
- Injectable now() func in ratelimiter.RedisLimiter, ratelimiter.CircuitBreaker,
  blockchain/oracle, tracing context store, testutils/redis — tests drive the
  clock forward without sleeping.
- export_test.go test-only setters expose the unexported clock hooks.
- publish_test slowServer blocks on <-ctx.Done() instead of delay-based sleep.
- db race-condition test polls pg_stat_activity for wait_event_type='Lock'
  to deterministically detect a blocked INSERT.
- noncemanager TestSimultaneousAllocation caps goroutines at
  BestGuessConcurrency to avoid deadlocking on the backend's semaphore.
- require.Never predicates tolerate transient errors by returning false
  instead of failing inside nested helpers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SignStagedEnvelope computed ExpiryUnixtime from time.Now() on every call,
but PublishPayerEnvelopes signs the same staged envelope twice: once
synchronously to return to the client, and again inside the publish
worker before storing it. Crossing a wall-clock second boundary between
those two calls produced a 1-second expiry drift, which in turn produced
different signed bytes and a different originator signature — so the
envelope the client received in the publish response did not match what
was later queried back from the DB.

Basing expiry on stagedEnv.OriginatorTime (set once by the DB at staging)
makes SignStagedEnvelope a pure function of its inputs, so both call
sites produce byte-identical output. This also removes the race that was
causing TestEnvelopesGenerator to flake on `-race` CI runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sOnlyNewMessages

Two bugs caused the test to flake after removing its time.Sleeps:

1. AwaitCursor checked each poller's LastSeen, which advances as soon as
   the DB is read — before the batch is dispatched to listeners. A stream
   registered between "polled" and "dispatched" would retroactively receive
   pre-seeded envelopes.

   Fix: track a dispatched VectorClock on subscribeWorker that only
   advances after dispatchToGlobals/Originators/Topics returns. AwaitCursor
   now waits on this cursor, guaranteeing any listener registered after it
   returns will not see the dispatched batch.

2. generateEnvelopes produces `perNode` envelopes per originator. With 2
   nodes and perNode=10, streamBatch = sourceEnvelopes[initialBatchSize:]
   was 15 envelopes instead of 5, so a single dispatch of 10 envelopes
   would blow past the expected count of 5.

   Fix: slice streamBatch to exact bounds
   [initialBatchSize : initialBatchSize+streamSize] and rename
   totalMessages → perNode to reflect the actual semantics.

Verified: TestSubscribeAll_StreamsOnlyNewMessages passes 20x under -race,
full pkg/api/message suite passes under -race, lint clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines -97 to +104
time.Now().UTC().
stagedEnv.OriginatorTime.UTC().
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the most meaningful change. Seems like a net positive (shouldn't expiry time be based on the envelope's time and not insertion time), but worth thinking through the consequences of changing this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug into the blast radius. Short version: it's a strict improvement, and expiry was actually already inconsistent with OriginatorNs before this change — which is what let the stress test flake.

The bug this fixes. PublishPayerEnvelopes signs the same staged envelope twice: once synchronously in service.go:730 to return to the client, and once asynchronously in publish_worker.go:315 before storing/syncing. Pre-fix, ExpiryUnixtime was computed from time.Now(), so if those two calls crossed a second boundary the two signatures differed. TestEnvelopesGenerator in pkg/stress caught this: it publishes (client receives bytes A) then queries back (server returns bytes B), and require.Equal fails on a 1-byte expiry varint diff + a completely different OriginatorSignature. That is — the client was getting a signature that did not match what was stored or synced to other nodes.

Was actually already inconsistent. OriginatorNs (line 99) has always been derived from stagedEnv.OriginatorTime.UnixNano() — the DB-set staging timestamp. Only ExpiryUnixtime was reaching for time.Now(). So within a single envelope you had "originator_ns = T0, retention_days = 30, expiry_unixtime = T0 + publish_queue_delay + 30d". This change aligns both fields on the same time source.

Consequences I walked through:

  1. Expiry is nominally earlier by the publish-queue delay — the gap between when the DB staged the row and when the publish worker signs it. In practice this is microseconds to a few seconds. Against 30-day retention that's 10⁻⁵ % earlier. pkg/prune/query.go compares expiry < EXTRACT(EPOCH FROM now()) so pruning just runs a few seconds earlier on the affected row — no change in semantics.

  2. No cross-node consensus concern. pkg/sync/envelope_sink.go ingests fully-signed envelopes from other originators and inserts them whole (Expiry: int64(expiry) from the incoming envelope). Receiving nodes never re-sign, so this is purely a per-originator semantic for envelopes that node signs. No consensus/quorum field is affected.

  3. No migration concern. Envelopes already in the DB keep their old expiry — this only changes how new envelopes get their expiry computed.

  4. Strictly improves clock-skew resilience. Pre-fix, OriginatorNs was DB now() and ExpiryUnixtime was node wall clock — if those drifted, the envelope was internally inconsistent. Post-fix both come from the same DB timestamp.

  5. SignStagedEnvelope is now a pure function of its inputs, which is what lets the double-sign in PublishPayerEnvelopes work correctly (and makes the function unit-testable without time mocking).

  6. Payer reports / fees unaffected. Fee calculation in prepareSingleEnvelope already uses stagedEnv.OriginatorTime for both CalculateBaseFee and CalculateCongestionFee — so the time source for fees, originator_ns, and now expiry is uniformly the staging timestamp.

No further code changes needed from my side; flagging so you can sanity-check the reasoning. If you'd rather have ExpiryUnixtime track wall-clock at signing (and instead fix the double-sign race by having the sync path reuse the signed bytes from the critical path), I'm happy to refactor — but that's a bigger change and I think the current approach is the better contract (deterministic, self-consistent envelopes).

@neekolas neekolas marked this pull request as ready for review April 13, 2026 17:42
@neekolas neekolas requested a review from a team as a code owner April 13, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test suite flakiness: 22 issues across race conditions, timing, and resource management

2 participants