fix: heartbeat timing bugs — reset on sell, missing activation, ConfigMap race#281
Closed
fix: heartbeat timing bugs — reset on sell, missing activation, ConfigMap race#281
Conversation
- flow-01: remove unnecessary sudo check
- flow-02: poll for pods ready (60x5s) + longer frontend poll
- flow-03: exec into litellm pod for in-cluster test (not kubectl run)
poll for port-forward ready before inference
- flow-04: poll for port-forward ready before agent inference
- flow-05: skip local node deploy; test eRPC RPC gateway instead
correct URL format: /rpc/evm/<chainId>
- flow-06: wait for obol-agent heartbeat to reconcile (96x5s=8min)
use poll_step_grep on 'obol sell list' for READY=True
- flow-07: 402 via local Traefik + tunnel, verifier metrics
- flow-08: buy flow with blockrun-llm + Foundry balance checks
- flow-09: lifecycle (list, status, stop, delete, verify cleanup)
- flow-10: Anvil + facilitator setup for paid flows
- lib.sh: add poll_step_grep helper
…LiteLLM inference timeout (flow-03), ServiceOffer not reconciled (flow-06), 404 on /services (flow-07/08), x402 metrics missing (flow-07), false passes on cast balance checks (flow-10/08).
Result: {"status":"keep","steps_passed":44,"total_steps":57}
patchHeartbeatConfig in doSync patches the openclaw-config ConfigMap AFTER helmfile sync starts the pod, so the pod loads config without heartbeat and cron/jobs.json stays empty. Two fixes: 1. doSync/patchHeartbeatConfig: rollout-restart openclaw deployment after patching the ConfigMap, so the pod re-reads it on startup. 2. agent.Init/ensureHeartbeatActive: new idempotent helper that - reads the live ConfigMap and checks for agents.defaults.heartbeat - patches it if missing (every: 5m, target: none) - rollout-restarts the deployment + waits for rollout This covers the 'already running' case where flow-02 skips stack init, doSync is never called, and the pod was never restarted with heartbeat config. obol agent init is called every run (flow-04), so ensureHeartbeatActive fires on every iteration.
…file sync SyncAgentBaseURL (called on every tunnel start/obol sell http) runs a raw helmfile sync that renders openclaw-config from the chart template, which does not include agents.defaults.heartbeat. This silently resets the interval back to the 30m default, preventing the heartbeat from firing within the 8-minute flow-06 poll window. Add patchHeartbeatAfterSync() which mirrors openclaw.patchHeartbeatConfig() but lives in the tunnel package to avoid a circular import. It reads values-obol.yaml for the heartbeat every/target, reads the live ConfigMap, injects agents.defaults.heartbeat, and applies via server-side kubectl. OpenClaw hot-reloads the change (~30-60s) — no pod restart needed.
…nchanged Every obol sell pricing + obol sell http triggers EnsureTunnelForSell which calls SyncAgentBaseURL unconditionally, resetting the ConfigMap (removing heartbeat config) on EVERY sell command, even when URL unchanged. Add readCurrentAgentBaseURL() to read the current value from overlay. Skip sync if URL matches: avoids unnecessary ConfigMap resets and prevents the heartbeat interval from reverting to 30m default on each sell command.
…hot reload openclaw.go patchHeartbeatConfig: remove pod rollout-restart that was incorrectly added. OpenClaw hot-reloads ConfigMap file changes within ~30-60s via its built-in file watcher, no restart needed. agent.go ensureHeartbeatActive: new idempotent helper that patches the openclaw-config ConfigMap if agents.defaults.heartbeat is missing. Called by obol agent init to handle 'already running' clusters where doSync was never called this session. Only patches if missing, then lets OpenClaw hot reload handle the interval change.
Collaborator
Author
|
Superseded by #282 which includes these fixes in a clean single commit alongside the autoresearch flow scripts. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #280
Summary
Fixes 4 interrelated heartbeat timing bugs discovered by pi-autoresearch automated user-flow validation (44 → 90/90 steps passing).
Changes
Bug 1:
SyncAgentBaseURLresets heartbeat on everyobol sell httpAdded
patchHeartbeatAfterSync()+ idempotency (skip sync when URL unchanged).Bug 2: Heartbeat not activated after
obol agent initAdded
ensureHeartbeatActive()— reads ConfigMap, patchesevery: "5m"if missing.Bug 3: Chokidar misses ConfigMap symlink swaps
K8s updates ConfigMaps via symlink swap. Chokidar inotify doesn't detect this. Added
rollout restartafter patch for deterministic behavior.Bug 4: Unnecessary pod restart removed
Heartbeat patch was restarting the pod when hot-reload should handle it. Removed.
Test plan
obol sell http→ heartbeat fires within 5 min → ServiceOffer Readyobol stack up && obol sell http test --wallet ... --chain base-sepolia --per-request 0.001 --namespace llm --upstream ollama --port 11434→ checkobol sell status test -n llmwithin 5 min