fix(test): remove time.Sleep from tests for deterministic synchronization#1968
fix(test): remove time.Sleep from tests for deterministic synchronization#1968xmtp-coder-agent wants to merge 3 commits intoxmtp:mainfrom
Conversation
…tion Replaces wall-clock waits in test and production-test-support code with deterministic patterns (injectable clocks, channel barriers, context cancellation, require.Eventually/Never, pg_stat_activity polling) so the suite does not depend on timing for correctness. Notable patterns: - Injectable now() func in ratelimiter.RedisLimiter, ratelimiter.CircuitBreaker, blockchain/oracle, tracing context store, testutils/redis — tests drive the clock forward without sleeping. - export_test.go test-only setters expose the unexported clock hooks. - publish_test slowServer blocks on <-ctx.Done() instead of delay-based sleep. - db race-condition test polls pg_stat_activity for wait_event_type='Lock' to deterministically detect a blocked INSERT. - noncemanager TestSimultaneousAllocation caps goroutines at BestGuessConcurrency to avoid deadlocking on the backend's semaphore. - require.Never predicates tolerate transient errors by returning false instead of failing inside nested helpers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SignStagedEnvelope computed ExpiryUnixtime from time.Now() on every call, but PublishPayerEnvelopes signs the same staged envelope twice: once synchronously to return to the client, and again inside the publish worker before storing it. Crossing a wall-clock second boundary between those two calls produced a 1-second expiry drift, which in turn produced different signed bytes and a different originator signature — so the envelope the client received in the publish response did not match what was later queried back from the DB. Basing expiry on stagedEnv.OriginatorTime (set once by the DB at staging) makes SignStagedEnvelope a pure function of its inputs, so both call sites produce byte-identical output. This also removes the race that was causing TestEnvelopesGenerator to flake on `-race` CI runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sOnlyNewMessages Two bugs caused the test to flake after removing its time.Sleeps: 1. AwaitCursor checked each poller's LastSeen, which advances as soon as the DB is read — before the batch is dispatched to listeners. A stream registered between "polled" and "dispatched" would retroactively receive pre-seeded envelopes. Fix: track a dispatched VectorClock on subscribeWorker that only advances after dispatchToGlobals/Originators/Topics returns. AwaitCursor now waits on this cursor, guaranteeing any listener registered after it returns will not see the dispatched batch. 2. generateEnvelopes produces `perNode` envelopes per originator. With 2 nodes and perNode=10, streamBatch = sourceEnvelopes[initialBatchSize:] was 15 envelopes instead of 5, so a single dispatch of 10 envelopes would blow past the expected count of 5. Fix: slice streamBatch to exact bounds [initialBatchSize : initialBatchSize+streamSize] and rename totalMessages → perNode to reflect the actual semantics. Verified: TestSubscribeAll_StreamsOnlyNewMessages passes 20x under -race, full pkg/api/message suite passes under -race, lint clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| time.Now().UTC(). | ||
| stagedEnv.OriginatorTime.UTC(). |
There was a problem hiding this comment.
This seems like the most meaningful change. Seems like a net positive (shouldn't expiry time be based on the envelope's time and not insertion time), but worth thinking through the consequences of changing this.
There was a problem hiding this comment.
I dug into the blast radius. Short version: it's a strict improvement, and expiry was actually already inconsistent with OriginatorNs before this change — which is what let the stress test flake.
The bug this fixes. PublishPayerEnvelopes signs the same staged envelope twice: once synchronously in service.go:730 to return to the client, and once asynchronously in publish_worker.go:315 before storing/syncing. Pre-fix, ExpiryUnixtime was computed from time.Now(), so if those two calls crossed a second boundary the two signatures differed. TestEnvelopesGenerator in pkg/stress caught this: it publishes (client receives bytes A) then queries back (server returns bytes B), and require.Equal fails on a 1-byte expiry varint diff + a completely different OriginatorSignature. That is — the client was getting a signature that did not match what was stored or synced to other nodes.
Was actually already inconsistent. OriginatorNs (line 99) has always been derived from stagedEnv.OriginatorTime.UnixNano() — the DB-set staging timestamp. Only ExpiryUnixtime was reaching for time.Now(). So within a single envelope you had "originator_ns = T0, retention_days = 30, expiry_unixtime = T0 + publish_queue_delay + 30d". This change aligns both fields on the same time source.
Consequences I walked through:
-
Expiry is nominally earlier by the publish-queue delay — the gap between when the DB staged the row and when the publish worker signs it. In practice this is microseconds to a few seconds. Against 30-day retention that's 10⁻⁵ % earlier.
pkg/prune/query.gocomparesexpiry < EXTRACT(EPOCH FROM now())so pruning just runs a few seconds earlier on the affected row — no change in semantics. -
No cross-node consensus concern.
pkg/sync/envelope_sink.goingests fully-signed envelopes from other originators and inserts them whole (Expiry: int64(expiry)from the incoming envelope). Receiving nodes never re-sign, so this is purely a per-originator semantic for envelopes that node signs. No consensus/quorum field is affected. -
No migration concern. Envelopes already in the DB keep their old expiry — this only changes how new envelopes get their expiry computed.
-
Strictly improves clock-skew resilience. Pre-fix,
OriginatorNswas DBnow()andExpiryUnixtimewas node wall clock — if those drifted, the envelope was internally inconsistent. Post-fix both come from the same DB timestamp. -
SignStagedEnvelope is now a pure function of its inputs, which is what lets the double-sign in
PublishPayerEnvelopeswork correctly (and makes the function unit-testable without time mocking). -
Payer reports / fees unaffected. Fee calculation in
prepareSingleEnvelopealready usesstagedEnv.OriginatorTimefor bothCalculateBaseFeeandCalculateCongestionFee— so the time source for fees, originator_ns, and now expiry is uniformly the staging timestamp.
No further code changes needed from my side; flagging so you can sanity-check the reasoning. If you'd rather have ExpiryUnixtime track wall-clock at signing (and instead fix the double-sign race by having the sync path reuse the signed bytes from the critical path), I'm happy to refactor — but that's a bigger change and I think the current approach is the better contract (deterministic, self-consistent envelopes).
Resolves #1963
Follow-up to #1966 that addresses mkysel's directive: "Make a new PR and get rid of the sleeps in tests." Replaces wall-clock
time.Sleepcalls in tests (and the production-test-support code they drive) with deterministic synchronization patterns so the suite no longer depends on timing for correctness.Patterns applied
now func() time.Time) in:pkg/ratelimiter/redis_limiter.go— Lua script takes the client-provided timestamp, so overridingnowfully controls refill mathpkg/ratelimiter/circuit_breaker.gopkg/blockchain/oracle/oracle.gopkg/tracing/context_store.gopkg/testutils/redis/redis.go_testpackages viaexport_test.goSetNowForTest(...)helpers so the public API stays clean.pkg/api/payer/publish_test.goslowServernow blocks on<-ctx.Done()driven by the client deadline instead of sleeping for a configured duration.require.Eventuallyfor concurrent coordination acrosspkg/api/message,pkg/api/metadata,pkg/payerreport,pkg/db/sequences_test.go,pkg/db/pgx_test.go.pg_stat_activitypolling inpkg/db/payer_test.gorace-condition test: waits forwait_event_type = 'Lock'on apayersquery to deterministically detect that the concurrentFindOrCreatePayerWithRetrygoroutine is blocked on T1's unique-index lock before committing T1.pkg/blockchain/noncemanager/manager_test.goTestSimultaneousAllocationnow capsnumGoroutines = BestGuessConcurrencyso all goroutines can actually reach the barrier (SQL/Redis backends cap concurrent nonce holders at 32 via semaphore — 50 goroutines deadlocked the prior barrier).require.Neverpredicates: e.g.pkg/payerreport/workers/integration_test.goTestCanGenerateReportnow inlines the QueryEnvelopes RPC inside the predicate and returnsfalseon error instead of calling a helper that fails the whole test on transient context-cancel errors.pkg/api/payer/selectors/node_selector_test.go— the firstGetNodecall performs TCP latency probing synchronously insideupdateLatencyCache, so the pre-call sleeps were doing nothing.Files touched
Verification
grep -rn 'time\.Sleep' --include='*_test.go' pkg/returns only a comment reference (no live calls).dev/lint-fix— clean (0 issues).go build ./...— clean.-count/-race):TestSimultaneousAllocation(x3)TestFindOrCreatePayerWithRetry/race_condition(x10)TestCanGenerateReport(x5)pkg/api/message(x3)Test plan
time.Sleepin*_test.goexcept comments🤖 Generated with Claude Code
Note
Remove
time.Sleepcalls from tests and replace with deterministic synchronizationtime.Sleepcalls across the test suite with barrier-based synchronization: WaitGroups, channels, polling loops, andrequire.Neverassertions.dispatchedMetandadvanceDispatchedtosubscribeWorkerand new test helpersAwaitCursor/AwaitGlobalListenerstomessage.Serviceso subscribe tests can block until the worker has dispatched specific envelopes.nowfunction intoCircuitBreaker,RedisLimiter, andTraceContextStoreso timing-sensitive tests advance a fake clock instead of sleeping.WithMaxStaleDurationoption tooracle.Newso oracle tests can force cache staleness without wall-clock delays.registrant.SignStagedEnvelopeto derive expiry fromOriginatorTimerather thantime.Now(), making signed envelopes deterministic. Behavioral Change: expiry timestamps in signed envelopes are now relative to the originator's time, not the signing time.Macroscope summarized f415714.