feat(ce-work-beta): add beta Codex delegation mode by tmchow · Pull Request #476 · EveryInc/compound-engineering-plugin

tmchow · 2026-04-01T03:18:25Z

Summary

Adds Codex delegation as a beta-only execution path in ce:work-beta, validated through 6 iterations of evaluation comparing delegation vs standard mode across small, medium, large, and extra-large plans. The final architecture uses batched delegation, reference file extraction for token efficiency, and a <testing> prompt section that closes the test quality gap.

Credit to @mvanhorn for the original push on delegation and @huntharo for permissioning thought partnership. This branch supersedes the earlier direction from #364 and #365, now implemented as a beta rollout.

How to Test (Pre-Merge)

This is a beta skill — it requires manual invocation and does not affect the stable ce:work path.

Option A: Point at a local checkout (simplest)

git clone https://github.com/EveryInc/compound-engineering-plugin.git
cd compound-engineering-plugin
git checkout feat/codex-delegation-work

Then from any project directory:

claude --plugin-dir /path/to/compound-engineering-plugin/plugins/compound-engineering

Option B: Use the plugin-path CLI (keeps your checkout clean)

cd compound-engineering-plugin
bun install
bun run src/index.ts plugin-path compound-engineering --branch feat/codex-delegation-work

This outputs a --plugin-dir path. On re-run it pulls the latest from the remote branch.

claude --plugin-dir <path from above>

Try it out:

# Standard mode (no delegation) — should behave identically to ce:work
/ce:work-beta path/to/your-plan.md

# With Codex delegation (requires codex CLI installed)
/ce:work-beta delegate:codex path/to/your-plan.md

On first delegation, the skill prompts for one-time consent and sandbox mode selection.

Optional: --dangerously-skip-permissions reduces permission prompts during delegation (codex exec, git operations). Not required.

Prerequisites for delegation: Codex CLI installed and on PATH (npm install -g @openai/codex).

Key Findings from Evaluation

6 iterations of eval with real code implementation in isolated worktrees:

Plan size	Units	Delegate tokens	Standard tokens	Overhead	Verdict
Small	1-3	51-63k	38-42k	+34-50%	Not worth it for token savings
Medium	4	54k	53k	+2%	Marginal
Large	7	62k	62k	+1%	Break-even
Extra-large	10	54k	62k	-13%	Delegation is cheaper

Crossover point: ~5-7 units. Above that, delegation saves Claude tokens. Users may still choose delegation below the crossover for cost arbitrage (Codex tokens cheaper) or coding preference.

Architecture

Reference file extraction (the biggest optimization)

Delegation content (~250 lines) extracted from SKILL.md body to references/codex-delegation-workflow.md, loaded on demand only when delegation is active. SKILL.md shrank from 776 to 514 lines -- a 34% body reduction saving ~15k Claude tokens per non-delegation run.

Batched execution model

Replaced per-unit delegation (N codex exec calls) with batched delegation (ceil(N/5) calls). All units in one batch for plans <=5 units; split at roughly 5 for larger plans. Reduced orchestration from O(N) to O(batches).

Codex owns verify

Codex runs tests and fixes failures within the delegation -- the orchestrator does not re-verify independently. Safety net: circuit breaker (3 consecutive failures -> standard mode fallback), plus Phase 3 full test suite before shipping.

flowchart TD
  A[ce:work-beta invoked with plan] --> B{delegation_active?}
  B -- no --> C[Standard execution]
  B -- yes --> D[Read delegation workflow reference]

  D --> E[Pre-delegation checks: platform, env, CLI, consent]
  E -- any fail --> C
  E -- all pass --> F{work_delegate_decision}

  F -- auto --> G[State plan and proceed]
  F -- ask --> H[Present recommendation, wait for choice]
  H -- Claude Code --> C
  H -- Codex --> G

  G --> I{All units trivial?}
  I -- yes --> C
  I -- no --> J[Batch units]

  J --> K[Write prompt with testing + verify sections]
  K --> L[codex exec per batch]
  L --> M{Result classification}
  M -- success --> N[Commit batch]
  M -- partial --> O[Finish locally, commit]
  M -- failure --> P[Rollback, increment circuit breaker]
  P --> Q{3 failures?}
  Q -- yes --> C
  Q -- no --> K

  N --> R{More batches?}
  R -- yes --> S[Report progress, continue]
  S --> K
  R -- no --> T[Cleanup scratch, Phase 3]

What Changed

plugins/compound-engineering/
  AGENTS.md                                  # refined shell-chaining rule
  compound-engineering.local.example.md      # NEW — settings template
  skills/
    ce-work-beta/
      SKILL.md                               # slimmed body (776 -> 514 lines)
      references/
        codex-delegation-workflow.md          # NEW — full delegation workflow
        swarm-mode.md                        # NEW — extracted from body
    ce-work/
      SKILL.md                               # stable, non-delegating
    ce-plan/
      SKILL.md                               # neutral, handoffs stay on ce:work

docs/
  brainstorms/
    2026-03-31-codex-delegation-requirements.md   # NEW — 19 requirements
  plans/
    2026-03-31-001-feat-codex-delegation-plan.md  # NEW — 6 implementation units
  solutions/
    best-practices/
      codex-delegation-best-practices-2026-04-01.md  # NEW — compounded learning
    skill-design/
      ce-work-beta-promotion-checklist-2026-03-31.md # NEW — promotion steps

tests/
  pipeline-review-contract.test.ts           # delegation contract coverage

Prompt Engineering for Delegation

The Codex prompt template includes:

<testing> section -- Test Scenario Completeness guidance (happy path, edge cases, error paths, integration). Improved Codex test output by ~35% on large plans.
<verify> section -- requires running ALL tests in a single command (not per-file) to catch cross-file contamination. Discovered in eval when mocked globalThis.fetch leaked between test files.
<constraints> -- no git commits, scoped changes, honest result reporting
<output_contract> -- structured result schema for classification

User Settings

# .claude/compound-engineering.local.md
---
work_delegate: codex              # codex | false (default: false)
work_delegate_consent: true       # true | false (default: false)
work_delegate_sandbox: yolo       # yolo | full-auto (default: yolo)
work_delegate_decision: auto      # auto | ask (default: auto)
---

Promotion Follow-Up

When ready to promote beta -> stable:

Copy validated implementation from ce:work-beta into ce:work
Remove beta-only manual-invocation caveats
Update planner and workflow handoffs atomically
Move contract tests from beta surface to stable surface
Retire or redirect ce:work-beta

See docs/solutions/skill-design/ce-work-beta-promotion-checklist-2026-03-31.md for detailed steps.

Test plan

bun test tests/pipeline-review-contract.test.ts -- 24 pass, delegation contract verified
bun test -- 547 pass (1 pre-existing path sanitization failure)
6 iterations of eval: routing correctness, code correctness, token measurement, circuit breaker fallback
Codex delegation end-to-end: prompt generation -> codex exec -> result classification -> commit (verified with real codex CLI)

Post-Deploy Monitoring & Validation

No additional operational monitoring required. Changes are limited to skill contracts, reference files, planning docs, and test coverage in the plugin repo. The beta skill requires manual invocation (ce:work-beta) and does not affect the stable ce:work path.

🤖 Generated with Claude Opus 4.6 (1M context, extended thinking) via Claude Code

Adds optional `delegate:codex` mode to ce:work that delegates code implementation to the Codex CLI (`codex exec`) using concrete bash templates. Replaces ce-work-beta's prose-based delegation which caused non-deterministic CLI invocations. Key additions: - Argument parsing with `delegate:codex`/`delegate:local` tokens and resolution chain (argument > local.md > default off) - Pre-delegation gates: environment guard, availability check, one-time consent flow with sandbox mode selection (yolo/full-auto) - XML-tagged prompt template following gpt-5-4-prompting best practices - Multi-signal result classification (CLI fail/task fail/partial/verify fail/success) with rollback-to-HEAD safety - Circuit breaker: 3 consecutive failures -> standard mode fallback - Serial execution enforced, swarm mode mutual exclusion - Frontend Design Guidance ported from ce-work-beta - ce-work-beta delegation section marked superseded - `Execution target: external-delegate` removed from ce:plan Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Remove stale "external delegation" per-unit posture from ce:plan Execution note examples — ce:work reads delegation from the global resolution chain, not unit metadata. 2. Fix delegation fallback to re-enter standard strategy selection. Pre-delegation checks now run inside the routing gate before strategy choice, so disabling delegation falls through to the normal inline/serial/parallel table instead of silently defaulting to inline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4304970c34

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

plugins/compound-engineering/skills/ce-work/SKILL.md

Restore stable ce:work as the non-delegating execution path during the Codex beta rollout and move the active delegation contract back to ce:work-beta. Also add a promotion checklist doc covering the workflow and contract changes required when ce:work-beta is later promoted to stable.

tmchow · 2026-04-01T04:02:08Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3f7069e77

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

plugins/compound-engineering/skills/ce-work-beta/SKILL.md

…acked files Preflight now uses `git diff --quiet HEAD` instead of `git status --short` so untracked workspace dirs and .context/ scratch don't falsely block delegation. Rollback uses path-scoped `git clean -fd -- <unit files>` instead of bare `git clean -fd` which would nuke all untracked files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…antics The old `test -n || test -n` returned exit code 1 on the happy path (neither var set), which a literal agent could misread as a failed pre-check and disable delegation in eligible environments. Rewrote as an explicit if/else so pass/fail lives in the variable value, not the exit code. Also refined the AGENTS.md shell-chaining rule to distinguish action chaining (bad) from boolean conditions in if/while guards (fine). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tmchow · 2026-04-01T06:52:34Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3f7069e77

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

plugins/compound-engineering/skills/ce-work-beta/SKILL.md

Moves ~270 lines of delegation workflow (pre-checks, prompt template, execution loop, result classification) to references/codex-delegation-workflow.md and ~25 lines of swarm mode to references/swarm-mode.md. SKILL.md body drops from ~776 to ~514 lines — a 34% reduction in per-tool-call context cost for non-delegation runs. New in the delegation reference: - Batched execution model (all units in one batch, split at ~5) - Codex owns VERIFY (test-fix loop inside delegation) - Platform gate (Claude Code only) - Run-ID namespaced scratch files for concurrent safety - work_delegation_decision setting (auto/ask) with user-facing prompts - Between-batch checkpoints (flow through by default) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provides a discoverable template with all available settings (work_delegate, work_codex_consent, work_codex_sandbox, work_delegation_decision) so users can copy to .claude/compound-engineering.local.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… prompt Adds a <testing> section to the delegation prompt template that carries the Test Scenario Completeness guidance to Codex (cover happy path, edge cases, error paths, integration). Closes the test quality gap observed in evals (Codex produced 57-85% as many tests without this guidance). Also updates <verify> to require running ALL test files in a single command rather than per-file — catches cross-file contamination like mocked globals leaking between test files in the same bun process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Compounded learning from 6 iterations of delegation evals covering token crossover points, batching strategy, prompt engineering, skill body size as multiplicative cost driver, and user choice considerations. Key findings: delegation breaks even at ~5-7 units and becomes cheaper at 10+. Skill body size dominates cost (multiplicative across all tool calls). Extract conditional content >50 lines to reference files. Also fixes verify section line break in contract test assertion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Documents how plan structure directly enables delegation decisions — file lists enable batching rules, test scenarios feed Codex prompts, verification commands enable Codex's self-check loop. Delegation works with unstructured plans but makes conservative choices without signals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… net Expands the "delegate verify to Codex" pattern with the reasoning: trust the delegate's self-report, protect against systematic failure with the circuit breaker (3 consecutive failures -> standard mode), and verify the whole at Phase 3 before shipping. Three layered catches replace the redundant per-batch orchestrator verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The doc covers economics, architecture, prompt engineering, plan quality, safety model, and user choice — not just economics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds actual token counts from all iterations (not just percentages), wall clock time comparison, test coverage cost, and the iteration evolution table showing the body-size regression and recovery. The economics section now tells the complete story with raw numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Renames for consistency across all delegation settings: - work_codex_consent -> work_delegate_consent - work_codex_sandbox -> work_delegate_sandbox - work_delegation_decision -> work_delegate_decision All four settings now share the work_delegate_* prefix: work_delegate, work_delegate_consent, work_delegate_sandbox, work_delegate_decision Updated across: SKILL.md, delegation workflow reference, example local.md, best practices doc, requirements, plan, and contract tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Unrecognized setting values (e.g., work_delegate: gemini) now fall through to hard defaults instead of producing undefined behavior. Each setting documents its recognized values inline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Shorter, more natural. Users think in terms of "codex mode" not "delegate to codex." The mode: prefix is generic enough for future delegates. Deactivation becomes mode:local. Fuzzy activation phrases unchanged (use codex, codex mode, etc.). Updated across: SKILL.md, contract tests, requirements, and plan docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This reverts commit 4c5da21.

Adds work_delegate_model (default: gpt-5.4) and work_delegate_effort (default: high) to local.md settings. Both are passed explicitly to codex exec via -m and -c 'model_reasoning_effort="..."' flags. Model is passthrough (any valid codex model name). Effort is validated against 5 values: minimal, low, medium, high, xhigh. Invalid values fall through to defaults. Also fixes --yolo to --dangerously-bypass-approvals-and-sandbox (the documented flag name) and adds quoting guidance for the -c flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…odex exec Launches codex exec with run_in_background (no timeout ceiling) then polls every 10 seconds in a foreground bash loop to keep the agent's turn active. User sees "Waiting for Codex..." during execution and cannot interfere with the working tree. Fixes the 10-minute Bash timeout ceiling that would kill long-running batches where Codex is iterating on test fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tmchow and others added 2 commits March 31, 2026 19:57

chatgpt-codex-connector bot reviewed Apr 1, 2026

View reviewed changes

plugins/compound-engineering/skills/ce-work/SKILL.md Outdated Show resolved Hide resolved

plugins/compound-engineering/skills/ce-work/SKILL.md Outdated Show resolved Hide resolved

tmchow marked this pull request as draft April 1, 2026 03:24

tmchow changed the title ~~feat(ce-work): add Codex delegation mode~~ feat(ce-work-beta): add beta Codex delegation mode Apr 1, 2026

chatgpt-codex-connector bot reviewed Apr 1, 2026

View reviewed changes

plugins/compound-engineering/skills/ce-work-beta/SKILL.md Outdated Show resolved Hide resolved

plugins/compound-engineering/skills/ce-work-beta/SKILL.md Outdated Show resolved Hide resolved

tmchow and others added 2 commits March 31, 2026 23:21

chatgpt-codex-connector bot reviewed Apr 1, 2026

View reviewed changes

plugins/compound-engineering/skills/ce-work-beta/SKILL.md Outdated Show resolved Hide resolved

plugins/compound-engineering/skills/ce-work-beta/SKILL.md Outdated Show resolved Hide resolved

tmchow and others added 15 commits April 1, 2026 12:42

docs(ce-work-beta): rename to Codex Delegation Best Practices

6ba3036

The doc covers economics, architecture, prompt engineering, plan quality, safety model, and user choice — not just economics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "refactor(ce-work-beta): rename delegate:codex to mode:codex"

c17d938

This reverts commit 4c5da21.

chore: gitignore .claude/worktrees/ and remove eval workspace

c497a99

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ce-work-beta): add beta Codex delegation mode#476

feat(ce-work-beta): add beta Codex delegation mode#476
tmchow wants to merge 20 commits intomainfrom
feat/codex-delegation-work

tmchow commented Apr 1, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

tmchow commented Apr 1, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

tmchow commented Apr 1, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tmchow commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to Test (Pre-Merge)

Key Findings from Evaluation

Architecture

Reference file extraction (the biggest optimization)

Batched execution model

Codex owns verify

What Changed

Prompt Engineering for Delegation

User Settings

Promotion Follow-Up

Test plan

Post-Deploy Monitoring & Validation

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

tmchow commented Apr 1, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

tmchow commented Apr 1, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tmchow commented Apr 1, 2026 •

edited

Loading