Skip to content

ADR-023: pi-agent harness and wasmtime/c2w containers#100

Draft
wmeddie wants to merge 17 commits intomainfrom
spike/pi-harness
Draft

ADR-023: pi-agent harness and wasmtime/c2w containers#100
wmeddie wants to merge 17 commits intomainfrom
spike/pi-harness

Conversation

@wmeddie
Copy link
Copy Markdown
Member

@wmeddie wmeddie commented Apr 20, 2026

Summary

Draft PR for ADR-023, which sets the direction for a clean re-do of the container-isolation work after spike/wanix-agents (PR #98) drifted out of scope.

The ADR reframes Xpressclaw as an explicit meta-harness, locks in wasmtime + container2wasm as the sole container runtime (embedded like llama-cpp), removes Docker support entirely (supersedes ADR-003 in full), distributes harness images via GHCR, and makes the sidecar the single LLM endpoint for every harness so budget-aware transparent downgrade to local inference is architectural, not per-harness.

V1 ships only two harnesses — PiHarness (new) and a retrofitted built-in Claude-SDK harness — with seven concrete MVP exit criteria so the re-do has an anti-drift lever.

What this PR is not

Key decisions captured

  • WASM runtime: wasmtime (c2w's first-class host, Rust-native, WASIp2 stable, epoch interruption as the rollback primitive)
  • Container runtime: container2wasm (c2w), no Docker fallback
  • Image distribution: GHCR (ghcr.io/xpressai/harnesses/*)
  • LLM routing: all harnesses talk to the sidecar only; sidecar owns provider keys and budget routing
  • First-class harnesses in V1: pi + retrofitted built-in; codex/opencode deferred
  • Supersedes ADR-003 in full; touches ADR-002, ADR-005, ADR-010, ADR-013, ADR-017

Review focus

  • Does the MVP exit criteria (7 items) actually capture "direction validated"?
  • Is there anything I should have made a non-negotiable that's listed as an open question, or vice versa?
  • Pi's license posture and the shell-bridge streaming question are flagged as open — if you already know the answers, tell me and I'll fold them in before merging.

Test plan

  • Read ADR end-to-end
  • Confirm MVP criteria are the right proof points
  • Sanity-check "supersedes ADR-003 in full" against any Docker-dependent workflows we currently support

wmeddie added 2 commits April 20, 2026 18:44
Establishes the direction for the next spike attempt after
spike/wanix-agents drifted out of scope. Xpressclaw reframes as an
explicit meta-harness, Docker is removed in full (supersedes ADR-003),
c2w on wasmtime becomes the sole container runtime embedded like
llama-cpp, GHCR is the image distribution channel, and the sidecar
becomes the single LLM endpoint for every harness so budget-aware
transparent downgrade to local inference is an architectural property.
V1 scope is two harnesses (pi + retrofitted built-in) and seven MVP
exit criteria, so we have something to hold the next attempt
accountable to.
wmeddie added 2 commits April 20, 2026 19:19
First slice of the ADR-023 meta-harness refactor. Defines the `Harness`
trait in `xpressclaw-core::harness`, implements it for the existing
`DockerManager` so the rest of the codebase gains a trait-object path
without any behavior change, and exposes an `AppState::harness()`
accessor alongside the existing `AppState::docker()`.

The trait is deliberately narrow — only the lifecycle + endpoint +
observability surface the reconciler, task dispatcher, and message
processor actually consume. Image management, snapshotting, and
tmux-attach ship in follow-up commits (tasks 2/3/8/9).

No callers migrated yet. This is scaffolding; subsequent commits in
this task swap `AppState::docker()` users over to
`AppState::harness()` one module at a time.
Replaces two direct DockerManager::connect() sites with trait-object
calls via state.harness() and endpoint_port(agent_id). Removes two
per-request fresh Docker connections — the harness() accessor reuses
the shared cached connection.

This also proves the Harness trait in hot paths (message streaming +
cancel) before landing the c2w implementation. Reconciler, task
dispatcher, and apps routes stay on DockerManager for now; they
migrate once C2wHarness exists and the switch has architectural
payoff instead of being a mechanical rename.
Embeds wasmtime 27 + wasmtime-wasi as workspace deps and adds a new
xpressclaw_core::c2w module with the low-level runtime primitive that
task 3 (C2wHarness) will compose with.

C2wRuntime wraps wasmtime Engine with:
- async_support for tokio-driven guests
- epoch_interruption enabled; a background Tokio task ticks the
  engine epoch counter every 50ms so guests can be deadline-aborted
  (this is the rollback primitive)
- backtrace capture for diagnostics

InstanceSpec + C2wInstance model the per-agent guest: env, preopened
dirs (for the host-guest filesystem bridge), WASI args, optional
epoch deadline. MVP exposes run_to_completion only — long-lived
instances with an HTTP endpoint land in task 3.

Target WASI version is preview 1 because c2w emits preview-1 modules
today. If c2w adopts preview 2 we switch; the module layering is
designed to isolate that choice.

Tests: unit test confirms the runtime constructs and rejects bogus
module bytes. End-to-end (running an actual c2w-compiled image)
requires having an image built, which lands in task 3.
Adds `C2wHarness` — a `Harness` implementation backed by the
`C2wRuntime` primitive from task 2 — and a `xpressclaw c2w-smoke`
subcommand that exercises the full lifecycle on a real machine.

C2wHarness scope:
- Per-agent tokio tasks drive each guest to completion (or forever,
  for long-lived harnesses).
- Lifecycle: launch/stop/stop_all/list/is_running/uptime_secs work.
- `ContainerSpec.image` is interpreted as a filesystem path to a
  prebuilt WASM module — GHCR pulling + OCI-to-WASM conversion ships
  in task 4 without changing call sites.
- Stop aborts the driver task (epoch deadline would be ≤ 50ms latency
  otherwise).

Deferred to follow-up commits in this series:
- Real stdout/stderr capture into the returned Harness::logs buffer.
  Right now guests inherit the process stdio; `logs()` returns empty
  strings. Requires threading MemoryOutputPipe through C2wInstance;
  minor refactor of the primitive's API so lands cleanly as 3b.
- Host-side port exposure (endpoint_port returns None). Requires
  wasi-sockets plumbing that only becomes real with a HTTP-serving
  harness image — ships with task 4 alongside PiHarness.

The `xpressclaw c2w-smoke` subcommand builds a noop WASI guest from
embedded WAT and launches it through C2wHarness end-to-end. This is
the first thing a developer can actually run to verify the runtime
works on their machine. Subcommand is intentionally temporary — task 4
adds `xpressclaw harness add/run` that subsumes this with a real
workload.

Non-obvious bug fixed while writing this: wasmtime's epoch deadline
is absolute, not relative, and the background tick driver advances the
engine's counter from runtime construction. A store-default deadline
of 0 therefore traps the first time the counter ticks past 0 —
basically immediately. Fixed by setting a huge default deadline
(u64::MAX/2 ≈ 14 billion years at 50ms/tick); caller-specified
deadlines land as a follow-up when snapshot/rollback (task 8) needs
precise per-step budgets.

Tests: c2w unit test (runtime smoke) + harness unit test
(launch-and-run noop wasm) both pass. Manual verification via the
c2w-smoke CLI also passes end-to-end on macOS.
Adds the pi-agent harness layer on top of C2wHarness. Pi-specific
conventions baked into every launch:

- Per-agent workspace dir created on the host and preopened into the
  guest at /workspace.
- OPENAI_API_BASE env var defaulted to http://127.0.0.1:8935/v1 so
  pi-compiled guests hit the xpressclaw sidecar (task 6 makes this
  endpoint real).
- XCLAW_SOCKET env var defaulted to /run/xclaw.sock so pi's shell
  verbs know where to find the bridge (task 5 wires it).
- Caller-provided env/volumes win on conflict — PiHarness only fills
  in defaults.

HarnessImageResolver is scaffolded with a file-path-only resolve().
OCI refs (ghcr.io/xpressai/harnesses/pi:tag) return a clear error
that names task 4b. That task is in the tracker; resolver interface
is stable so the eventual fill-in doesn't churn call sites.

Two unit tests cover the pi launch path and the OCI-stub error. A
new `xpressclaw pi-smoke` CLI subcommand runs the whole layer
end-to-end with a noop WASM guest — creates the workspace, seeds
env, launches, waits for exit, cleans up. Companion to c2w-smoke;
both get removed when task 5/6 deliver the real pi flow.
Implements the xclaw shell bridge that lets non-MCP harnesses (pi and
friends) talk to xpressclaw as if they were using shell commands
instead of MCP tools (ADR-023 §7).

Wire format (xpressclaw_core::xclaw):
- Newline-delimited JSON over a Unix socket. One connection = one
  request, one response, close. Debuggable with socat/nc.
- Verbs are dot-separated (memory.add, memory.list, version); args
  are a flat JSON object; agent_id rides along for per-agent
  attribution.

Server listener (xpressclaw_server::xclaw_bridge):
- Bound at <data_dir>/run/xclaw.sock during server startup.
- Current verbs: version, memory.add, memory.list. More verbs
  (task.*, budget, log, ask) follow the same shape and drop in
  without protocol changes.
- Unix-only; on Windows the start() fn logs a warning and no-ops
  (WASM guests don't care about host OS, only the host socket does).

Client (xpressclaw-cli produces a second binary `xclaw`):
- argv → verb+args parser: `xclaw memory add --content "hi" --tags a,b`
  becomes {verb: "memory.add", args: {content, tags}}.
- Reads XCLAW_SOCKET + XCLAW_AGENT_ID env vars (set by PiHarness
  from task 4's constants).
- Exit codes: 0=success, 1=server-side verb failed, 2=transport,
  3=usage. Scripts in guests can branch on these.

Tests: protocol roundtrip in core, transport roundtrip in server
(real UnixStream + UnixListener), argv parser in the client. No
end-to-end smoke CLI yet — manually runnable today by starting
`xpressclaw up` and invoking `XCLAW_SOCKET=<data>/run/xclaw.sock
xclaw version` from another terminal; add an automated smoke once
PiHarness (task 4) mounts the socket into a live guest.

Verb coverage is deliberately narrow for this commit. Remaining
verbs in ADR-023 §7 (task.create/update/status/list, budget, log,
ask) follow the same dispatch pattern and land in follow-up commits
as each verb's backing API is needed.
…023 task 6)

Extends the existing OpenAI-compatible endpoint at /v1/chat/completions
into the single LLM entry point every harness talks to per ADR-023 §6.
The endpoint already had agent_id extraction and degraded-model
override; this commit fills the remaining gaps:

- Hard-stop enforcement. If the agent is paused or `on_exceeded: stop`
  over limit, the request is refused with HTTP 429. `alert` mode is
  logged but lets the request through.
- Streaming support. `{"stream": true}` routes through chat_stream()
  and returns OpenAI-style SSE with chunks passed through unmodified
  and a terminal `data: [DONE]`.
- Token usage recording. On non-streaming completion, pulls the
  `usage` field off the provider response and writes a `usage_logs`
  row via `CostTracker`, then updates `BudgetManager` spend. On
  streaming, tokens are counted approximately (chars/4 for output,
  0 for prompt) since `ChatCompletionChunk` doesn't carry usage;
  proper streaming accounting is follow-up task 11.

Three new unit tests use a canned in-crate provider:
- records_usage_for_agent: row lands in usage_logs after a request
- works_without_auth: still 200, no usage recorded
- honors_degraded_model_override: seeded degraded_model in budget_state
  makes the provider see `"local"` instead of the caller's
  `"canned-model"` — the ADR-023 §6 "transparent downgrade" promise.
Completes ADR-023 decision 4. Docker is no longer a dependency; agent
workloads run on wasmtime + c2w exclusively.

## What was deleted

- `crates/xpressclaw-core/src/docker/` module (manager, images, bollard
  plumbing) — 984 lines gone.
- `crates/xpressclaw-core/src/runtime.rs` — dead-code orchestrator (no
  external callers).
- `bollard` workspace + core dependency.
- `impl Harness for DockerManager` scaffolding from harness/mod.rs.
- `AppState::docker()` accessor and the docker field on AppState.
- Docker check in `xpressclaw init` and `xpressclaw up`.
- Graceful-shutdown Docker pass in server.rs (replaced with
  `harness.stop_all()`).
- `impl From<bollard::errors::Error> for Error`.

## What stayed but got re-wired

- `ContainerSpec` / `ContainerInfo` / `VolumeMount` types moved from
  `docker/manager.rs` to `harness/types.rs`. They describe the generic
  launch contract and have no Docker specifics.
- `AppState::harness()` now returns the stored `Arc<dyn Harness>`
  directly instead of wrapping DockerManager.
- Server routes (agents, conversations) use `state.harness()` for
  live status / port lookup / stop / logs.

## What got stubbed pending follow-up

- `agents::reconciler` retains only its Ollama-model reconciliation.
  Agent container launching / orphan-task requeuing is paused until
  task 10 (GHCR pull) lands the real launch path.
- `tasks::dispatcher::load_task` early-returns Requeue — the remainder
  of the state machine needs an Arc<dyn Harness> handle which lands
  with task 10.
- `routes::apps` — app-container endpoints (launch / logs / proxy)
  return 503. Agent-app containers need their own ADR post-spike;
  out of scope here.
- `routes::setup::check_docker` and `start_docker` kept as
  compatibility stubs that report `removed: true`. Task 12 rips the
  frontend setup-wizard step out.

## ADR / docs

- ADR-003 marked "Superseded by ADR-023".
- CLAUDE.md updated: container runtime is now "wasmtime +
  container2wasm", not Docker.

## Scope boundaries (deliberately unchanged)

- Agent session / message-processor paths: already trait-object via
  `AppState::harness()` from task 1's migration; they gracefully
  handle `None` so the spike branch still compiles without real
  agents running.
- LLM sidecar (/v1/chat/completions): untouched — task 6 work stands.
- xclaw bridge: untouched — task 5 work stands.

## Diff summary

-2860 / +203 = net -2657 lines.
All 327 core + 53 server library tests pass. Workspace builds clean
(clippy: no errors; a few warnings about now-unused args on the
stubbed app/dispatcher paths — intentionally left with `_` prefixes
so they're easy to re-wire in task 10).
Adds the rollback-on-failure plumbing that backs MVP criterion 7 of
ADR-023 ("rogue `rm -rf /` → automatic rollback, host unaffected").

## Trait surface

`Harness` gains three new methods, all with sensible defaults:

- `snapshot(agent_id) -> SnapshotId` — capture the guest's persistent
  state so a future `restore` can roll it back.
- `restore(agent_id, &SnapshotId)` — revert persistent state to a
  prior snapshot.
- `delete_snapshot(&SnapshotId)` — free the snapshot's backing
  storage.

The default implementations return "not supported" for `snapshot` and
`restore` and no-op for `delete_snapshot`. Harnesses that can persist
guest state override as needed.

## C2wHarness implementation

Tracks the preopen list per running agent; `snapshot` copies each
preopened directory to `<cache_dir>/snapshots/<uuid>/<index>/`.
`restore` stops-and-replaces the original directories from the
snapshot copy. `delete_snapshot` rm -rfs the backing dir.

Scope honestly reflected in the code:
- Snapshot covers filesystem (preopens), not in-flight WASM memory
  or tmux session pty. That matches the ADR's "drop the `Store` and
  restart" model — the guest is expected to be re-instantiated
  after restore.
- `restore` reverts filesystem but doesn't re-launch the guest —
  the task-dispatcher caller (task 10) decides whether to stop/start.
- Symlinks inside a preopen aren't followed during copy; snapshots
  contain regular files + directory structure.

## Test + CLI smoke

New unit test `snapshot_and_restore_roundtrip_workspace` seeds a file
in a preopened dir, snapshots, mutates, restores, and asserts the
mutations are reverted.

New `xpressclaw rollback-smoke` subcommand runs the same flow from
the CLI — launches a c2w guest, simulates a rogue tool call
rewriting the workspace, restores the snapshot, and verifies the
filesystem is reverted. Prints a step-by-step narration that ends
with `Smoke test passed.` on success.

This is the seventh MVP exit criterion made runnable. Wiring into
the real task dispatcher (pre-step snapshot, on-failure restore)
lands when task 10 gives the dispatcher an `Arc<dyn Harness>` handle.
…k 9)

Surfaces two ADR-023 features in the conversation page: the
transparent budget downgrade from task 6 becomes visible, and the
tmux-attach entry point from a future pi harness has its UI slot
reserved.

## Backend

- `Harness::attach_tmux(agent_id)` added with default `None`. Concrete
  harnesses (pi, future shell-native backends) override it to return
  their tmux session descriptor.
- New `TmuxAttach { session_name, socket_path }` type.
- `GET /api/agents/:id/tmux` returns `{ available, session? }`.
- `GET /api/budget/:id` now includes `degraded_model` and `is_paused`
  alongside the existing `BudgetSummary` fields so the UI can render
  the downgrade chip without a second call.

## Frontend

- New API types `AgentBudgetState` and `AgentTmuxStatus` in `$lib/api`.
- `budget.agent(id)` + `agentHarness.tmux(id)` helpers.
- Conversation page (`routes/conversations/[id]/+page.svelte`):
  - Fetches budget + tmux state for `primaryAgent` reactively on
    navigation; refreshes when the primary agent changes.
  - **Downgrade chip** (amber border, `🪫`-style icon, "running on
    <model> (budget)") renders in the agent-status row when the
    sidecar has swapped in a local fallback. Hidden otherwise.
  - **Tmux attach button** renders in the header's icon cluster only
    when the harness advertises `available: true`. Currently hidden
    (no harness exposes tmux yet); wired to a stub click handler
    pending xterm.js integration alongside the first real
    tmux-exposing harness.

## What's deliberately not in this commit

- xterm.js integration + the WebSocket terminal stream. That lands
  with the first tmux-exposing harness (pi, via task 10's real
  agent flow) — doing it now would be speculative plumbing against
  a missing backend.
- The `attach_tmux` override for `PiHarness`. Pi images running
  under c2w don't have a host-visible tmux socket until task 10
  wires the socket preopen through; adding the override now without
  the socket path would be fiction.

## Tests

- `svelte-check`: 0 errors, 115 warnings (all pre-existing).
- `cargo test -p xpressclaw-server --features metal --lib`:
  53 pass (no regressions).
- clippy + rustfmt clean.

Covers MVP criterion 4 UX surface + criterion 6 UX surface from
ADR-023. The signals are wired end-to-end: a user watching a
conversation sees the downgrade the instant the sidecar triggers it.
…ADR-023 task 10 phase 1)

Makes the desktop app runnable end-to-end on a machine with zero pi
harness image available, so the spike's whole stack can be smoke-tested
in the real UI. Real GHCR OCI pull is task 10 phase 2 (owed once a pi
WASM is published); until then, agents launch against a bundled noop
harness that ships in the binary.

## Changes

**`HarnessImageResolver::with_fallback`** — new constructor. When the
image ref doesn't resolve to a local file, writes the bundled noop
WASM into the cache dir and returns that path. Bundled WAT is compiled
to WASM on first use via `wat::parse_str` (moved from dev-dep to
regular dep). Old `::new()` constructor keeps the strict "file-only,
else error" behavior for tests + the `pi-smoke` CLI.

**`AgentConfig::image: Option<String>`** — new optional field on the
config struct. Users can set a local `.wasm` path for development or
an OCI ref for production (once OCI lands). Missing / None → bundled
fallback. Serde default keeps existing configs compatible.

**`AppState::set_harness`** — setter for installing the harness at
runtime. Called once from `server::serve()` with
`PiHarness { C2wHarness { C2wRuntime }, HarnessImageResolver::with_fallback, <data>/workspaces }`.
Harness directory tree is `<data>/harness-cache/` + `<data>/workspaces/`;
both are created on startup.

**`agents::reconciler::start`** restored with a real
`reconcile_agents` that calls `harness.launch()` for agents with
`desired_status=running` and `harness.stop()` for stopped agents.
Errors land in the agent record's `AgentStatus::Error` — visible in
the UI. Ollama model reconciliation (unchanged since task 7) still
runs alongside.

**Reconciler signature gains `harness: Option<Arc<dyn Harness>>`** —
passed in from `server::serve()` via `state.harness().await` so the
reconciler and routes share one harness. When wasmtime init fails, the
harness arg is `None` and the agent loop logs once and skips; the
server stays up.

## What a user can now do in the desktop app

1. `xpressclaw init && xpressclaw up` (or launch the Tauri bundle).
2. Go through setup.
3. Create an agent — leave image blank or set to any path, will fall
   back to bundled noop harness.
4. Agent enters `starting` → `running` via the reconciler within ≤10s.
5. Send a chat message. The conversation page falls through to the
   LLM router (no endpoint port on the noop guest), so if a provider
   is configured (OpenAI/Anthropic/local/ollama) the response streams
   from there.
6. Budget tracking, transparent downgrade, xclaw verbs, memory, tasks
   — all work since they're server-side.

## Known limitations deliberately unfixed

- Agents don't self-respond; the noop guest has no HarnessClient
  endpoint. Fills in with a real pi image via task 10 phase 2.
- Task dispatcher still early-returns Requeue because it doesn't have
  `Arc<dyn Harness>` in scope; threading that through the dispatcher's
  state machine lands with the same commit that makes pi actually
  respond (task 10 phase 2).
- Agent edit/create UI in the frontend doesn't expose the `image`
  field — users configuring non-default images edit the YAML. UI
  surface is task-12-phase-C polish.

## Tests

- 4 harness unit tests pass, including the new
  `resolver_with_fallback_materializes_bundled_wasm`.
- 327 core + 53 server library tests pass (no regressions).
- clippy + rustfmt clean.
Adds a harness that can actually *respond* when the user sends a chat
message, so the desktop app demonstrates the agent → harness → LLM →
response loop honestly. Installed at server startup in place of the
previous bundled-WASM-noop path.

## What it does

`EchoHarness` implements the `Harness` trait. On `launch(agent_id)`:

1. Binds `127.0.0.1:0` — the OS picks an unused port per agent.
2. Spawns a Tokio task serving axum's `/v1/chat/completions` on that
   port (both streaming and non-streaming).
3. `endpoint_port(agent_id)` returns the bound port, so the
   conversations processor connects the real HarnessClient instead of
   falling back to the LlmRouter.
4. Each request prepends a pinned system-prompt banner identifying
   the harness and forwards through `LlmRouter::chat_stream` —
   responses come from whatever provider is configured (cloud or
   local) but visibly route through the harness first.

`stop(agent_id)` aborts the per-agent task; `stop_all` iterates; list
/ is_running / uptime_secs report live state.

## Why in-process and not WASM

A WASM guest listening on a host-reachable TCP port requires one of:
- wasmtime-wasi preview 2 + `wasi:sockets` wired in `C2wInstance`;
  forces every c2w guest (and the future real pi image) to be a
  preview-2 module.
- wasmtime-wasi preview 1 + a host shim that backs a preopen FD with
  a real socket; works with today's c2w but is custom plumbing.

Both are real work and only pay off once a real pi-as-c2w image
exists on GHCR (task 10 phase 2). Until then EchoHarness lives behind
the same `Harness` trait + `AppState::harness()` surface, so the
swap to a WASM harness is a one-line change in `server::serve()` the
moment the images are ready.

## What users see now

1. Launch the desktop app.
2. Configure an agent.
3. Reconciler picks up `desired_status=running` and calls
   `EchoHarness::launch`; the agent gets a real host port and status
   goes to `running`.
4. Send a chat message. The conversation flow reaches the agent's
   per-agent HTTP server, which forwards through the LLM router.
5. The agent's reply starts with the harness banner — proof that the
   response flowed through the harness, not directly.
6. Budget tracking, transparent downgrade, xclaw bridge, memory,
   tasks: all continue working.

## Tests

- 3 new unit tests (banner prepend with + without existing system
  message, lifecycle roundtrip).
- 56 server library tests total (53 prior + 3 new).
- 327 core tests still pass. clippy + rustfmt clean.

## Known limitations

- EchoHarness is in-process and has full host access — *not* the
  sandboxed story ADR-023 promises. Clearly marked in the module
  doc + commit as a demo path; replaced by the c2w+PiHarness path
  once real images exist.
- logs() returns empty (harness logs via `tracing` into the server's
  log stream; no separate capture).
- Snapshot/restore aren't meaningful for an in-process harness;
  falls back to trait default (unsupported).
… task 10 phase 2)

Closes the "fake fallback only" limitation of the image resolver so
you can test the WASM-sandboxed harness against a real OCI registry
— GHCR in production, local podman during dev — before committing
to any merge.

## OCI pull

`HarnessImageResolver::resolve` now dispatches on ref shape:

- Filesystem path → use directly (dev path).
- `host[:port]/path[:tag]` → OCI artifact pull via `oci-client`.
- Else, with fallback enabled → materialize the bundled noop WASM.
- Else → error explaining what was expected.

The OCI path pulls the manifest, fetches the first layer's blob, and
caches it on disk keyed by manifest digest so retags/repulls are free.
Plain HTTP is auto-enabled when the registry is localhost /
127.0.0.1 / ::1, so `podman run -p 5000:5000 registry:2` works out of
the box without TLS ceremony.

Auth: reads `XPRESSCLAW_REGISTRY_TOKEN` from the environment and
sends it as `Bearer`; anonymous otherwise. `gh auth token` piped into
this env var is the simplest GHCR path once real images ship there.

New `is_local_registry` + `looks_like_oci_ref` helpers isolate the
heuristics, each with unit coverage.

## Harness backend switch

`server::serve()` reads `XPRESSCLAW_HARNESS`:

- `echo` (default) — the in-process EchoHarness from the prior
  commit. No external dependencies, works out of the box.
- `pi` — installs `PiHarness` on `C2wRuntime`. Use with
  `XPRESSCLAW_HARNESS=pi xpressclaw up` when a WASM harness image
  is available. Falls back to echo if wasmtime init fails.

## Local podman test recipe

```
# Run a registry locally
podman run -d -p 5000:5000 --name registry registry:2

# Push a WASM blob (use the bundled noop for now, or your own pi build)
oras push localhost:5000/pi:dev pi.wasm

# Configure an agent with image: localhost:5000/pi:dev in xpressclaw.yaml

# Start xpressclaw with the pi harness
XPRESSCLAW_HARNESS=pi xpressclaw up
```

`PiHarness::launch` → resolver pulls from the local registry, caches
under `<data>/harness-cache/sha256-<digest>.wasm`, `C2wHarness`
instantiates it on wasmtime. Agent status goes to `running`.

## Known limitations

Today the only WASM you can realistically push is something that
exits immediately (like the bundled noop). A real "serves HTTP on a
host-reachable port" harness still needs wasi-sockets wiring in
`C2wInstance` (wasmtime-wasi preview 2 switch) — that's separate
work. This commit proves the image-delivery pipeline works; the
content-that-actually-responds pipeline is the next piece.

## Tests

- 7 harness unit tests pass (3 new: `oci_ref_heuristic`,
  `local_registry_detection`, plus the two renamed resolver tests).
- 56 server library tests pass.
- 327 core tests pass.
- clippy + fmt clean.
Extends build.sh with an opt-out push step matching the rest of the
script's convention (like --skip-docker / --skip-test). Runs by
default; pass --skip-push to skip.

- `build.sh --pi-image=<ref>` or `XCLAW_PI_IMAGE=...` overrides the
  target. Default is `localhost:5000/pi:dev` so `podman run -d -p
  5000:5000 --name xclaw-registry registry:2` is enough setup.
- Skips gracefully with a one-line message when `oras` isn't
  installed or no registry responds at the ref's host — matches how
  --skip-docker behaves when `docker` is absent. Doesn't hard-fail.
- Pushes as an OCI artifact with media type
  `application/vnd.xpressclaw.harness.wasm+v1` so future harness
  types (codex, opencode) can share a repo and differentiate by
  media type.

Adds a `xpressclaw write-bundled-wasm <path>` CLI subcommand that
materializes the bundled noop WASM to a given file, so the push
step doesn't need `wat2wasm`/wabt on the host — it just asks the
already-built CLI to dump its bundled wasm.

Once a real pi image is being compiled via c2w, swap
write-bundled-wasm for a real build command and this same push
step handles it.
Before this, EchoHarness was a dumb proxy — it forwarded chat
completion requests to the LLM router without injecting available MCP
tools or handling tool_calls in the response. The model noticed the
absence and started *narrating* tool calls in prose
(`search_memory("user preferences")` as code-block text), which the
user flagged: tools weren't actually being executed.

Root cause: the agent loop lived in the Docker-era harness container
(claude-agent-sdk did tool dispatch internally). EchoHarness replaced
that with nothing. Fix: do the loop in EchoHarness.

## What the handler does now

1. Prepend the harness banner to the system prompt (unchanged).
2. If the caller didn't set `tools`, inject all MCP tool schemas from
   the shared `McpManager` so the LLM knows what's callable.
3. Call the LLM non-streaming. Up to `MAX_TOOL_TURNS` (20) times:
   - If the response has `tool_calls`, append the assistant message
     to history, execute each call via `McpManager::call_tool`, then
     append `tool`-role messages with the flattened text of each
     result. Loop.
   - If no `tool_calls`, this is the terminal turn.
4. For the terminal turn: if the caller asked for streaming,
   re-invoke the LLM in streaming mode with the full accumulated
   history and stream the output. Otherwise return the JSON as-is.

Tool-using turns never surface as chat messages to the user — they're
internal. The user sees the final answer with tools having been
invoked behind the scenes. Matches the claude-agent-sdk behavior
xpressclaw was originally designed around.

## Wire changes

- `EchoHarness::new(router, mcp_manager)` — added second arg.
- `EchoHandlerState` gains `mcp_manager: Arc<McpManager>`.
- Server startup: pass `state.mcp_manager.clone()` when constructing.
- New helper: `format_tool_result(&McpToolResult) -> String` flattens
  text/image/resource blocks for tool-role messages.

## Limits

- `MAX_TOOL_TURNS = 20` — generous for normal multi-step tasks,
  tight enough to fail fast on runaway loops. Exceeding it returns
  HTTP 507 with a diagnostic.
- Image results render as `[image: <mime>]` placeholder text; binary
  data is dropped. LLMs that want to *receive* images back need a
  vision-capable endpoint and per-message image support we haven't
  wired. Acceptable for text-first tools.
- Streaming re-runs the final turn in streaming mode. Costs one
  extra (cheap) LLM call per conversation. Fair trade for
  token-by-token UI output.

Tests: 3 existing `echo_harness` tests pass. Full behavioral coverage
of the agent loop requires a mock LLM provider + MCP manager — that's
larger plumbing than fits here; the code is exercised end-to-end
whenever the desktop app issues a tool-using request.
@wmeddie wmeddie deployed to integration April 21, 2026 00:44 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant