Skip to content

Resumed sandbox can report idle while exec/fs are broken and driver reconnect nudges stay queued #1350

@rblalock

Description

@rblalock

Summary

This appears to be a different failure mode than #1343.

For codesess_3541ed2048cc, the sandbox resumed far enough that the platform now reports it as idle, but the resumed instance is broken:

  • Hub could reconnect to the sandbox instance
  • Hub attempted the post-resume USR1 reconnect nudge twice
  • both nudge executions stayed stuck queued
  • direct execute() against the sandbox returns 500 Internal Server Error
  • direct filesystem listing returns 404 with cow-merged: no such file or directory
  • the long-running driver job still reports running

So the sandbox looks alive in metadata/status, but control and filesystem operations are not actually usable.

Affected IDs

  • Hub session: codesess_3541ed2048cc
  • Sandbox: sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959
  • Driver job: job_30e61ba73aad740180b7e179
  • Queued reconnect-nudge executions:
    • exe_46f293072d7c38298388331985865e578cb06cf3aae422619f802f9b0e94
    • exe_932ef65f40e53bf3011886e4bc318ffa4d85f0391f87f8f6305c6896d6fa

Hub-side behavior

Server logs during the failed wake path included:

[WARN]  [hub] Controller rejected for unknown session codesess_3541ed2048cc
[INFO] Resuming sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959 for session codesess_3541ed2048cc
[WARN]  Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: failed to send USR1 reconnect nudge to the CRIU-restored driver (immediate post-resume): Internal Server Error
[INFO] Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: waiting for the CRIU-restored driver to reconnect via WebSocket after the USR1 reconnect nudge
[WARN]  Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: CRIU-restored driver still has not reconnected after 5000ms; sending a follow-up reconnect nudge
[WARN]  Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: failed to send USR1 reconnect nudge to the CRIU-restored driver (follow-up): Internal Server Error

Hub state then moved from creating to error after the 120s driver connect timeout:

  • final Hub error: Sandbox driver did not connect within 120000ms. Check sandbox background jobs for details.

Platform-side state

Sandbox metadata

Current platform sandbox state is:

  • idle

The original tracked driver job still shows:

  • status: running
  • no completion
  • no replacement job

Reconnect nudge executions

Hub sends the reconnect nudge by calling execute() inside the resumed sandbox with a command that sends USR1 to the rpc-driver process.

Those two executions exist on the platform and are still queued, not failed, not completed:

  • exe_46f293072d7c38298388331985865e578cb06cf3aae422619f802f9b0e94
  • exe_932ef65f40e53bf3011886e4bc318ffa4d85f0391f87f8f6305c6896d6fa

Their command is the expected Hub nudge command:

set -eu
if command -v pkill >/dev/null 2>&1; then
  if pkill -USR1 -f '[b]un /home/agentuity/client/driver/rpc-driver.ts'; then
    exit 0
  fi
  if pkill -USR1 -f '[b]un run /home/agentuity/client/driver/rpc-driver.ts'; then
    exit 0
  fi
fi
PID="$(ps -eo pid=,args= | awk '/[b]un (run )?\/home\/agentuity\/client\/driver\/rpc-driver\.ts/ { print $1; exit }')"
if [ -z "$PID" ]; then
  echo "rpc-driver process not found" >&2
  exit 1
fi
kill -USR1 "$PID"

Direct low-level probes against the resumed sandbox

  1. Direct filesystem list via SDK listFiles("/home/agentuity") failed with:
not found: error listing files in /home/agentuity: readdirent /working/sandbox/sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959/cow-merged: no such file or directory
  1. Direct SDK execute() failed with:
500 Internal Server Error

This is the same underlying symptom the Hub hit when trying to send the USR1 reconnect nudge.

Lifecycle timeline

Relevant recent platform events for the affected sandbox:

  • 2026-04-03T15:21:46Z suspend with checkpoint id ckpt_20c38cee01645cde
  • 2026-04-03T15:21:55Z lifecycle:resumed (deferred: true)
  • 2026-04-03T15:24:03Z another evacuation/suspend sequence
  • 2026-04-04T01:01:44Z lifecycle:reconcile(previous_status=suspended)

After the latest Hub wake attempt, the sandbox now reports idle, but control/filesystem behavior is still broken as described above.

Control experiment

I also created a fresh empty control sandbox in the same org/runtime and tested:

  1. create sandbox
  2. copy a file in after creation
  3. verify file via exec and fs listing
  4. pause sandbox
  5. resume sandbox manually via CLI
  6. verify file persistence and fresh exec/fs access after resume

That control path succeeded end-to-end. So this does not look like a general pause/resume outage; it looks specific to this resumed sandbox entering a bad state.

Expected behavior

If the sandbox reports idle, the following should also work consistently:

  • execute()
  • filesystem list/read operations
  • queued reconnect-nudge executions should start and complete or fail with a concrete process-level error

The platform should not report an idle sandbox whose filesystem mount path is missing and whose exec/control operations are stuck or returning 500.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions