What we observed
After adcp#3346 added per-route InMemoryReplayStore instances, the conformance grader's vector `neg/016-replayed-nonce` continued to fail intermittently against `https://agenticadvertising.org/api/training-agent/mcp-strict\`. The diagnostic was:
```
neg/016-replayed-nonce FAIL expected 401 with error="request_signature_replayed",
got 200 (error="(none)").
Vector: Second submission of a previously-accepted (keyid, nonce) within the replay window
```
Accurate, but not actionable. We chased in-process state-sharing bugs first (adcp#3338) before recognizing the actual cause: the agent runs min_machines_running = 2 web instances behind Fly's load balancer, and `InMemoryReplayStore` is per-process. The grader's two probes hit different machines on different LB rolls; the second machine had no record of the first nonce.
The fix was adcp#3351 — swap to PostgresReplayStore so all instances share one cache. But the architectural cause was only diagnosable by knowing the deployment topology, not from the grader output.
Why this matters
Adopters running >1 instance behind any load balancer (Cloudflare → multiple origins, Fly anycast, AWS ALB → autoscaled target group, k8s deployment with replicas > 1) will hit this. Today they:
- See vector 016 fail intermittently (sometimes both probes land on the same instance, vector passes; sometimes they don't, vector fails).
- Get a "got 200, expected 401" diagnostic that points at their verifier code, not their replay-store topology.
- Spend hours chasing in-process bugs.
The grader has all the information needed to surface the right hypothesis cheaply.
Two suggested enhancements
1. Run N pairs with forced connection rotation
Instead of one (probe1, probe2) pair, run K pairs with separate keep-alive connections (or short-lived sockets). On a single-instance verifier all K pairs reject. On a multi-instance verifier the pass-rate is roughly 1/N per pair where N is the instance count, so K=10 makes the bug deterministic with very high probability against any realistic deployment.
Numerically: 2 instances + K=10 → ~999/1000 chance the bug is detected; 4 instances + K=10 → ~999.9/1000. Acceptable runtime cost (10x the current single pair), and the user gets a count of how often the second probe was accepted, not a single coin flip.
2. Smarter diagnostic on (200, 200) outcomes
Augment the FAIL message with:
```
Both submissions accepted. If your verifier runs more than one process or
machine instance, this likely means the replay-store state is per-process
and not shared across the load-balanced pool. The default
`InMemoryReplayStore` is per-process; for distributed deployments use
`PostgresReplayStore` from `@adcp/client/signing/server` (or a Redis-
backed equivalent). See #1018
```
The grader doesn't need to verify the topology — pointing at the most-common architectural cause makes the diagnostic self-routing.
Combined: the K-pair rejection rate becomes the smoking gun
If 0/K pairs rejected: the verifier is broken (no replay protection at all).
If 1/K to (K-1)/K rejected: classic multi-instance signal. Surface the count and the cross-instance hypothesis.
If K/K rejected: vector PASSes.
Adopter context
Filed by the AAO training agent's first GCP KMS adopter (#3283 / #3308). Caught and worked around in adcp#3351, but the next adopter shouldn't have to invent the diagnosis. Sister issues from this rollout: #1020 (closed), #1022 (closed), #1025 (closed), #1031 (open — capability-gating skip).
What we observed
After adcp#3346 added per-route
InMemoryReplayStoreinstances, the conformance grader's vector `neg/016-replayed-nonce` continued to fail intermittently against `https://agenticadvertising.org/api/training-agent/mcp-strict\`. The diagnostic was:```
neg/016-replayed-nonce FAIL expected 401 with error="request_signature_replayed",
got 200 (error="(none)").
Vector: Second submission of a previously-accepted (keyid, nonce) within the replay window
```
Accurate, but not actionable. We chased in-process state-sharing bugs first (adcp#3338) before recognizing the actual cause: the agent runs
min_machines_running = 2web instances behind Fly's load balancer, and `InMemoryReplayStore` is per-process. The grader's two probes hit different machines on different LB rolls; the second machine had no record of the first nonce.The fix was adcp#3351 — swap to
PostgresReplayStoreso all instances share one cache. But the architectural cause was only diagnosable by knowing the deployment topology, not from the grader output.Why this matters
Adopters running >1 instance behind any load balancer (Cloudflare → multiple origins, Fly anycast, AWS ALB → autoscaled target group, k8s deployment with replicas > 1) will hit this. Today they:
The grader has all the information needed to surface the right hypothesis cheaply.
Two suggested enhancements
1. Run N pairs with forced connection rotation
Instead of one (probe1, probe2) pair, run K pairs with separate keep-alive connections (or short-lived sockets). On a single-instance verifier all K pairs reject. On a multi-instance verifier the pass-rate is roughly
1/Nper pair where N is the instance count, so K=10 makes the bug deterministic with very high probability against any realistic deployment.Numerically: 2 instances + K=10 → ~999/1000 chance the bug is detected; 4 instances + K=10 → ~999.9/1000. Acceptable runtime cost (10x the current single pair), and the user gets a count of how often the second probe was accepted, not a single coin flip.
2. Smarter diagnostic on (200, 200) outcomes
Augment the FAIL message with:
```
Both submissions accepted. If your verifier runs more than one process or
machine instance, this likely means the replay-store state is per-process
and not shared across the load-balanced pool. The default
`InMemoryReplayStore` is per-process; for distributed deployments use
`PostgresReplayStore` from `@adcp/client/signing/server` (or a Redis-
backed equivalent). See #1018
```
The grader doesn't need to verify the topology — pointing at the most-common architectural cause makes the diagnostic self-routing.
Combined: the K-pair rejection rate becomes the smoking gun
If 0/K pairs rejected: the verifier is broken (no replay protection at all).
If 1/K to (K-1)/K rejected: classic multi-instance signal. Surface the count and the cross-instance hypothesis.
If K/K rejected: vector PASSes.
Adopter context
Filed by the AAO training agent's first GCP KMS adopter (#3283 / #3308). Caught and worked around in adcp#3351, but the next adopter shouldn't have to invent the diagnosis. Sister issues from this rollout: #1020 (closed), #1022 (closed), #1025 (closed), #1031 (open — capability-gating skip).