Skip to content

bug(grader): neg/016 fails non-deterministically against multi-instance verifiers — diagnostic should surface cross-instance replay-state as the likely cause #1032

@bokelley

Description

@bokelley

What we observed

After adcp#3346 added per-route InMemoryReplayStore instances, the conformance grader's vector `neg/016-replayed-nonce` continued to fail intermittently against `https://agenticadvertising.org/api/training-agent/mcp-strict\`. The diagnostic was:

```
neg/016-replayed-nonce FAIL expected 401 with error="request_signature_replayed",
got 200 (error="(none)").
Vector: Second submission of a previously-accepted (keyid, nonce) within the replay window
```

Accurate, but not actionable. We chased in-process state-sharing bugs first (adcp#3338) before recognizing the actual cause: the agent runs min_machines_running = 2 web instances behind Fly's load balancer, and `InMemoryReplayStore` is per-process. The grader's two probes hit different machines on different LB rolls; the second machine had no record of the first nonce.

The fix was adcp#3351 — swap to PostgresReplayStore so all instances share one cache. But the architectural cause was only diagnosable by knowing the deployment topology, not from the grader output.

Why this matters

Adopters running >1 instance behind any load balancer (Cloudflare → multiple origins, Fly anycast, AWS ALB → autoscaled target group, k8s deployment with replicas > 1) will hit this. Today they:

  • See vector 016 fail intermittently (sometimes both probes land on the same instance, vector passes; sometimes they don't, vector fails).
  • Get a "got 200, expected 401" diagnostic that points at their verifier code, not their replay-store topology.
  • Spend hours chasing in-process bugs.

The grader has all the information needed to surface the right hypothesis cheaply.

Two suggested enhancements

1. Run N pairs with forced connection rotation

Instead of one (probe1, probe2) pair, run K pairs with separate keep-alive connections (or short-lived sockets). On a single-instance verifier all K pairs reject. On a multi-instance verifier the pass-rate is roughly 1/N per pair where N is the instance count, so K=10 makes the bug deterministic with very high probability against any realistic deployment.

Numerically: 2 instances + K=10 → ~999/1000 chance the bug is detected; 4 instances + K=10 → ~999.9/1000. Acceptable runtime cost (10x the current single pair), and the user gets a count of how often the second probe was accepted, not a single coin flip.

2. Smarter diagnostic on (200, 200) outcomes

Augment the FAIL message with:

```
Both submissions accepted. If your verifier runs more than one process or
machine instance, this likely means the replay-store state is per-process
and not shared across the load-balanced pool. The default
`InMemoryReplayStore` is per-process; for distributed deployments use
`PostgresReplayStore` from `@adcp/client/signing/server` (or a Redis-
backed equivalent). See #1018
```

The grader doesn't need to verify the topology — pointing at the most-common architectural cause makes the diagnostic self-routing.

Combined: the K-pair rejection rate becomes the smoking gun

If 0/K pairs rejected: the verifier is broken (no replay protection at all).
If 1/K to (K-1)/K rejected: classic multi-instance signal. Surface the count and the cross-instance hypothesis.
If K/K rejected: vector PASSes.

Adopter context

Filed by the AAO training agent's first GCP KMS adopter (#3283 / #3308). Caught and worked around in adcp#3351, but the next adopter shouldn't have to invent the diagnosis. Sister issues from this rollout: #1020 (closed), #1022 (closed), #1025 (closed), #1031 (open — capability-gating skip).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions