Resumed sandbox can report idle while exec/fs are broken and driver reconnect nudges stay queued

## Summary
This appears to be a different failure mode than #1343.

For `codesess_3541ed2048cc`, the sandbox resumed far enough that the platform now reports it as `idle`, but the resumed instance is broken:
- Hub could reconnect to the sandbox instance
- Hub attempted the post-resume `USR1` reconnect nudge twice
- both nudge executions stayed stuck `queued`
- direct `execute()` against the sandbox returns `500 Internal Server Error`
- direct filesystem listing returns `404` with `cow-merged: no such file or directory`
- the long-running driver job still reports `running`

So the sandbox looks alive in metadata/status, but control and filesystem operations are not actually usable.

## Affected IDs
- Hub session: `codesess_3541ed2048cc`
- Sandbox: `sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959`
- Driver job: `job_30e61ba73aad740180b7e179`
- Queued reconnect-nudge executions:
  - `exe_46f293072d7c38298388331985865e578cb06cf3aae422619f802f9b0e94`
  - `exe_932ef65f40e53bf3011886e4bc318ffa4d85f0391f87f8f6305c6896d6fa`

## Hub-side behavior
Server logs during the failed wake path included:

```text
[WARN]  [hub] Controller rejected for unknown session codesess_3541ed2048cc
[INFO] Resuming sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959 for session codesess_3541ed2048cc
[WARN]  Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: failed to send USR1 reconnect nudge to the CRIU-restored driver (immediate post-resume): Internal Server Error
[INFO] Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: waiting for the CRIU-restored driver to reconnect via WebSocket after the USR1 reconnect nudge
[WARN]  Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: CRIU-restored driver still has not reconnected after 5000ms; sending a follow-up reconnect nudge
[WARN]  Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: failed to send USR1 reconnect nudge to the CRIU-restored driver (follow-up): Internal Server Error
```

Hub state then moved from `creating` to `error` after the 120s driver connect timeout:
- final Hub error: `Sandbox driver did not connect within 120000ms. Check sandbox background jobs for details.`

## Platform-side state
### Sandbox metadata
Current platform sandbox state is:
- `idle`

The original tracked driver job still shows:
- `status: running`
- no completion
- no replacement job

### Reconnect nudge executions
Hub sends the reconnect nudge by calling `execute()` inside the resumed sandbox with a command that sends `USR1` to the `rpc-driver` process.

Those two executions exist on the platform and are still `queued`, not failed, not completed:
- `exe_46f293072d7c38298388331985865e578cb06cf3aae422619f802f9b0e94`
- `exe_932ef65f40e53bf3011886e4bc318ffa4d85f0391f87f8f6305c6896d6fa`

Their command is the expected Hub nudge command:

```bash
set -eu
if command -v pkill >/dev/null 2>&1; then
  if pkill -USR1 -f '[b]un /home/agentuity/client/driver/rpc-driver.ts'; then
    exit 0
  fi
  if pkill -USR1 -f '[b]un run /home/agentuity/client/driver/rpc-driver.ts'; then
    exit 0
  fi
fi
PID="$(ps -eo pid=,args= | awk '/[b]un (run )?\/home\/agentuity\/client\/driver\/rpc-driver\.ts/ { print $1; exit }')"
if [ -z "$PID" ]; then
  echo "rpc-driver process not found" >&2
  exit 1
fi
kill -USR1 "$PID"
```

### Direct low-level probes against the resumed sandbox
1. Direct filesystem list via SDK `listFiles("/home/agentuity")` failed with:

```text
not found: error listing files in /home/agentuity: readdirent /working/sandbox/sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959/cow-merged: no such file or directory
```

2. Direct SDK `execute()` failed with:

```text
500 Internal Server Error
```

This is the same underlying symptom the Hub hit when trying to send the `USR1` reconnect nudge.

## Lifecycle timeline
Relevant recent platform events for the affected sandbox:
- `2026-04-03T15:21:46Z` suspend with checkpoint id `ckpt_20c38cee01645cde`
- `2026-04-03T15:21:55Z` `lifecycle:resumed` (`deferred: true`)
- `2026-04-03T15:24:03Z` another evacuation/suspend sequence
- `2026-04-04T01:01:44Z` `lifecycle:reconcile(previous_status=suspended)`

After the latest Hub wake attempt, the sandbox now reports `idle`, but control/filesystem behavior is still broken as described above.

## Control experiment
I also created a fresh empty control sandbox in the same org/runtime and tested:
1. create sandbox
2. copy a file in after creation
3. verify file via `exec` and fs listing
4. pause sandbox
5. resume sandbox manually via CLI
6. verify file persistence and fresh `exec`/fs access after resume

That control path succeeded end-to-end. So this does **not** look like a general pause/resume outage; it looks specific to this resumed sandbox entering a bad state.

## Expected behavior
If the sandbox reports `idle`, the following should also work consistently:
- `execute()`
- filesystem list/read operations
- queued reconnect-nudge executions should start and complete or fail with a concrete process-level error

The platform should not report an `idle` sandbox whose filesystem mount path is missing and whose exec/control operations are stuck or returning `500`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resumed sandbox can report idle while exec/fs are broken and driver reconnect nudges stay queued #1350

Summary

Affected IDs

Hub-side behavior

Platform-side state

Sandbox metadata

Reconnect nudge executions

Direct low-level probes against the resumed sandbox

Lifecycle timeline

Control experiment

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resumed sandbox can report idle while exec/fs are broken and driver reconnect nudges stay queued #1350

Description

Summary

Affected IDs

Hub-side behavior

Platform-side state

Sandbox metadata

Reconnect nudge executions

Direct low-level probes against the resumed sandbox

Lifecycle timeline

Control experiment

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions