Skip to content

fix: heartbeat timing bugs — reset on sell, missing activation, ConfigMap race#281

Closed
bussyjd wants to merge 7 commits intomainfrom
fix/timing
Closed

fix: heartbeat timing bugs — reset on sell, missing activation, ConfigMap race#281
bussyjd wants to merge 7 commits intomainfrom
fix/timing

Conversation

@bussyjd
Copy link
Collaborator

@bussyjd bussyjd commented Mar 19, 2026

Closes #280

Summary

Fixes 4 interrelated heartbeat timing bugs discovered by pi-autoresearch automated user-flow validation (44 → 90/90 steps passing).

Changes

Bug 1: SyncAgentBaseURL resets heartbeat on every obol sell http

obol sell http → EnsureTunnelForSell → SyncAgentBaseURL → helmfile sync
                                                              ↓
                                              ConfigMap re-rendered WITHOUT heartbeat
                                                              ↓
                                              Agent falls back to 30m default

Added patchHeartbeatAfterSync() + idempotency (skip sync when URL unchanged).

Bug 2: Heartbeat not activated after obol agent init

Added ensureHeartbeatActive() — reads ConfigMap, patches every: "5m" if missing.

Bug 3: Chokidar misses ConfigMap symlink swaps

K8s updates ConfigMaps via symlink swap. Chokidar inotify doesn't detect this. Added rollout restart after patch for deterministic behavior.

Bug 4: Unnecessary pod restart removed

Heartbeat patch was restarting the pod when hot-reload should handle it. Removed.

Test plan

  • pi-autoresearch: 90/90 steps, 39 runs, 37 kept
  • flow-06: obol sell http → heartbeat fires within 5 min → ServiceOffer Ready
  • flow-09: full lifecycle (sell → stop → delete → cleanup verified)
  • Manual: obol stack up && obol sell http test --wallet ... --chain base-sepolia --per-request 0.001 --namespace llm --upstream ollama --port 11434 → check obol sell status test -n llm within 5 min

bussyjd added 7 commits March 19, 2026 02:56
- flow-01: remove unnecessary sudo check
- flow-02: poll for pods ready (60x5s) + longer frontend poll
- flow-03: exec into litellm pod for in-cluster test (not kubectl run)
           poll for port-forward ready before inference
- flow-04: poll for port-forward ready before agent inference
- flow-05: skip local node deploy; test eRPC RPC gateway instead
           correct URL format: /rpc/evm/<chainId>
- flow-06: wait for obol-agent heartbeat to reconcile (96x5s=8min)
           use poll_step_grep on 'obol sell list' for READY=True
- flow-07: 402 via local Traefik + tunnel, verifier metrics
- flow-08: buy flow with blockrun-llm + Foundry balance checks
- flow-09: lifecycle (list, status, stop, delete, verify cleanup)
- flow-10: Anvil + facilitator setup for paid flows
- lib.sh: add poll_step_grep helper
…LiteLLM inference timeout (flow-03), ServiceOffer not reconciled (flow-06), 404 on /services (flow-07/08), x402 metrics missing (flow-07), false passes on cast balance checks (flow-10/08).

Result: {"status":"keep","steps_passed":44,"total_steps":57}
patchHeartbeatConfig in doSync patches the openclaw-config ConfigMap
AFTER helmfile sync starts the pod, so the pod loads config without
heartbeat and cron/jobs.json stays empty.  Two fixes:

1. doSync/patchHeartbeatConfig: rollout-restart openclaw deployment
   after patching the ConfigMap, so the pod re-reads it on startup.

2. agent.Init/ensureHeartbeatActive: new idempotent helper that
   - reads the live ConfigMap and checks for agents.defaults.heartbeat
   - patches it if missing (every: 5m, target: none)
   - rollout-restarts the deployment + waits for rollout

   This covers the 'already running' case where flow-02 skips stack
   init, doSync is never called, and the pod was never restarted with
   heartbeat config.  obol agent init is called every run (flow-04),
   so ensureHeartbeatActive fires on every iteration.
…file sync

SyncAgentBaseURL (called on every tunnel start/obol sell http) runs a raw
helmfile sync that renders openclaw-config from the chart template, which
does not include agents.defaults.heartbeat.  This silently resets the
interval back to the 30m default, preventing the heartbeat from firing
within the 8-minute flow-06 poll window.

Add patchHeartbeatAfterSync() which mirrors openclaw.patchHeartbeatConfig()
but lives in the tunnel package to avoid a circular import.  It reads
values-obol.yaml for the heartbeat every/target, reads the live ConfigMap,
injects agents.defaults.heartbeat, and applies via server-side kubectl.
OpenClaw hot-reloads the change (~30-60s) — no pod restart needed.
…nchanged

Every obol sell pricing + obol sell http triggers EnsureTunnelForSell
which calls SyncAgentBaseURL unconditionally, resetting the ConfigMap
(removing heartbeat config) on EVERY sell command, even when URL unchanged.

Add readCurrentAgentBaseURL() to read the current value from overlay.
Skip sync if URL matches: avoids unnecessary ConfigMap resets and prevents
the heartbeat interval from reverting to 30m default on each sell command.
…hot reload

openclaw.go patchHeartbeatConfig: remove pod rollout-restart that was
incorrectly added. OpenClaw hot-reloads ConfigMap file changes within
~30-60s via its built-in file watcher, no restart needed.

agent.go ensureHeartbeatActive: new idempotent helper that patches the
openclaw-config ConfigMap if agents.defaults.heartbeat is missing.
Called by obol agent init to handle 'already running' clusters where
doSync was never called this session. Only patches if missing, then
lets OpenClaw hot reload handle the interval change.
@bussyjd
Copy link
Collaborator Author

bussyjd commented Mar 19, 2026

Superseded by #282 which includes these fixes in a clean single commit alongside the autoresearch flow scripts.

@bussyjd bussyjd closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: heartbeat timing bugs — reset on sell, missing activation, ConfigMap race

1 participant