Skip to content

Recover MCP tools after transport closes#13098

Closed
swordfish444 wants to merge 1 commit intoopenai:mainfrom
swordfish444:codex/mcp-transport-recovery
Closed

Recover MCP tools after transport closes#13098
swordfish444 wants to merge 1 commit intoopenai:mainfrom
swordfish444:codex/mcp-transport-recovery

Conversation

@swordfish444
Copy link
Contributor

@swordfish444 swordfish444 commented Feb 28, 2026

Summary

This PR makes MCP tool calls self-heal when the underlying MCP transport has closed, instead of requiring a Codex app restart.

It addresses the user-facing failure pattern reported in #6649 (Playwright / Chrome DevTools MCP ending in Transport closed and staying wedged).

What changed

  • Added a restartable MCP client wrapper in McpConnectionManager.
  • On tool-call failure containing transport closed, Codex now:
    1. Rebuilds the MCP client for that server.
    2. Re-sends latest sandbox state to the restarted server.
    3. Retries the same tool call once.
  • Keeps generation-based stale guards so concurrent failures do not trigger duplicate restarts.
  • Added integration regression coverage:
    • New test where a stdio MCP server exits after the first tool call; second call must recover successfully.
  • Extended test_stdio_server with a test-only MCP_TEST_EXIT_AFTER_CALL mode used by the regression test.

Why this helps browser MCPs

Playwright and Chrome DevTools MCP failures in long-lived sessions often surface as Transport closed and remain unrecoverable until app restart. This change converts that into in-session recovery for subsequent calls.

QA checklist

  • just fmt
  • just fix -p codex-core
  • just fix -p codex-rmcp-client
  • cargo build -p codex-rmcp-client --bin test_stdio_server
  • cargo test -p codex-core mcp_tool_call_recovers_from_transport_closed
  • cargo test -p codex-rmcp-client
  • cargo test --all-features (not run yet; can run before merge if requested)

cc @gpeal for visibility, since this overlaps the browser MCP transport thread history.

@swordfish444
Copy link
Contributor Author

@gpeal sharing this since it directly targets the long-lived browser MCP Transport closed wedge reported in #6649. This adds in-session restart/retry so users should no longer need full app restarts.

@etraut-openai
Copy link
Collaborator

We've updated our contribution guidelines to indicate that we're no longer accepting unsolicited code contributions. All code contributions are by invitation only. To read more about why we've taken this step, please refer to this announcement.

@swordfish444
Copy link
Contributor Author

Follow-up update after true end-to-end validation on macOS with the patched app-server:

Additional fix

  • Expanded transport-recovery detection in is_transport_closed_error(...) to include:
    • Transport send error
    • Transport receive error
    • transport errors containing Broken pipe / Connection reset
  • Added unit tests for these cases.

Automated checks run

  • just fmt
  • just fix -p codex-core
  • cargo test -p codex-core is_transport_closed_error
  • cargo test -p codex-core mcp_tool_call_recovers_from_transport_closed

E2E validation (isolated patched app-server)

Server:

  • ./target/debug/codex app-server --listen ws://127.0.0.1:4322

Chrome DevTools MCP

  1. First turn (tool call OK)
  • Thread: 019ca24b-08ab-7332-8c08-c3ae76afd564
  • Tool: mcp__chrome-devtools__new_page
  • Log: /tmp/e2e-chrome2-turn1.log
  1. Forced transport close
  • Killed child MCP process owned by the test app-server: kill -9 10162
  1. Second turn in same thread (no app restart)
  • Tool call completed successfully (Ok) in /tmp/e2e-chrome2-turn2.log
  • App-server log shows recovery path triggered:
    • MCP transport closed for server chrome-devtools; restarting

Playwright MCP

  1. First turn (tool call OK)
  • Thread: 019ca24c-1ce1-77b3-bdbb-67bb16fd69da
  • Tool: mcp__playwright__browser_tabs
  • Log: /tmp/e2e-playwright2-turn1.log
  1. Forced transport close
  • Killed child MCP processes owned by the test app-server: kill -9 10154 11782
  1. Second turn in same thread (no app restart)
  • Tool call completed successfully (Ok) in /tmp/e2e-playwright2-turn2.log
  • App-server log shows recovery path triggered:
    • MCP transport closed for server playwright; restarting

This reproduces the transport-drop scenario and verifies in-session recovery for both affected browser MCPs without restarting Codex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants