bug(grader): neg/016 fails non-deterministically against multi-instance verifiers — diagnostic should surface cross-instance replay-state as the likely cause

## What we observed

After [adcp#3346](https://github.com/adcontextprotocol/adcp/pull/3346) added per-route `InMemoryReplayStore` instances, the conformance grader's vector \`neg/016-replayed-nonce\` continued to fail intermittently against \`https://agenticadvertising.org/api/training-agent/mcp-strict\`. The diagnostic was:

\`\`\`
neg/016-replayed-nonce  FAIL  expected 401 with error=\"request_signature_replayed\",
                              got 200 (error=\"(none)\").
                              Vector: Second submission of a previously-accepted (keyid, nonce) within the replay window
\`\`\`

Accurate, but not actionable. We chased in-process state-sharing bugs first ([adcp#3338](https://github.com/adcontextprotocol/adcp/issues/3338)) before recognizing the actual cause: the agent runs `min_machines_running = 2` web instances behind Fly's load balancer, and \`InMemoryReplayStore\` is per-process. The grader's two probes hit different machines on different LB rolls; the second machine had no record of the first nonce.

The fix was [adcp#3351](https://github.com/adcontextprotocol/adcp/pull/3351) — swap to `PostgresReplayStore` so all instances share one cache. But the architectural cause was only diagnosable by knowing the deployment topology, not from the grader output.

## Why this matters

Adopters running >1 instance behind any load balancer (Cloudflare → multiple origins, Fly anycast, AWS ALB → autoscaled target group, k8s deployment with replicas > 1) will hit this. Today they:
- See vector 016 fail intermittently (sometimes both probes land on the same instance, vector passes; sometimes they don't, vector fails).
- Get a "got 200, expected 401" diagnostic that points at their verifier code, not their replay-store topology.
- Spend hours chasing in-process bugs.

The grader has all the information needed to surface the right hypothesis cheaply.

## Two suggested enhancements

### 1. Run N pairs with forced connection rotation

Instead of one (probe1, probe2) pair, run K pairs with separate keep-alive connections (or short-lived sockets). On a single-instance verifier all K pairs reject. On a multi-instance verifier the pass-rate is roughly `1/N` per pair where N is the instance count, so K=10 makes the bug deterministic with very high probability against any realistic deployment.

Numerically: 2 instances + K=10 → ~999/1000 chance the bug is detected; 4 instances + K=10 → ~999.9/1000. Acceptable runtime cost (10x the current single pair), and the user gets a count of how often the second probe was accepted, not a single coin flip.

### 2. Smarter diagnostic on (200, 200) outcomes

Augment the FAIL message with:

\`\`\`
Both submissions accepted. If your verifier runs more than one process or
machine instance, this likely means the replay-store state is per-process
and not shared across the load-balanced pool. The default
\`InMemoryReplayStore\` is per-process; for distributed deployments use
\`PostgresReplayStore\` from \`@adcp/client/signing/server\` (or a Redis-
backed equivalent). See https://github.com/adcontextprotocol/adcp-client/pull/1018
\`\`\`

The grader doesn't need to verify the topology — pointing at the most-common architectural cause makes the diagnostic self-routing.

### Combined: the K-pair rejection rate becomes the smoking gun

If 0/K pairs rejected: the verifier is broken (no replay protection at all).
If 1/K to (K-1)/K rejected: classic multi-instance signal. Surface the count and the cross-instance hypothesis.
If K/K rejected: vector PASSes.

## Adopter context

Filed by the AAO training agent's first GCP KMS adopter (#3283 / #3308). Caught and worked around in [adcp#3351](https://github.com/adcontextprotocol/adcp/pull/3351), but the next adopter shouldn't have to invent the diagnosis. Sister issues from this rollout: #1020 (closed), #1022 (closed), #1025 (closed), #1031 (open — capability-gating skip).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(grader): neg/016 fails non-deterministically against multi-instance verifiers — diagnostic should surface cross-instance replay-state as the likely cause #1032

What we observed

Why this matters

Two suggested enhancements

1. Run N pairs with forced connection rotation

2. Smarter diagnostic on (200, 200) outcomes

Combined: the K-pair rejection rate becomes the smoking gun

Adopter context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug(grader): neg/016 fails non-deterministically against multi-instance verifiers — diagnostic should surface cross-instance replay-state as the likely cause #1032

Description

What we observed

Why this matters

Two suggested enhancements

1. Run N pairs with forced connection rotation

2. Smarter diagnostic on (200, 200) outcomes

Combined: the K-pair rejection rate becomes the smoking gun

Adopter context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions