Skip to content

feat(ce-work-beta): add beta Codex delegation mode#476

Draft
tmchow wants to merge 20 commits intomainfrom
feat/codex-delegation-work
Draft

feat(ce-work-beta): add beta Codex delegation mode#476
tmchow wants to merge 20 commits intomainfrom
feat/codex-delegation-work

Conversation

@tmchow
Copy link
Copy Markdown
Collaborator

@tmchow tmchow commented Apr 1, 2026

Summary

Adds Codex delegation as a beta-only execution path in ce:work-beta, validated through 6 iterations of evaluation comparing delegation vs standard mode across small, medium, large, and extra-large plans. The final architecture uses batched delegation, reference file extraction for token efficiency, and a <testing> prompt section that closes the test quality gap.

Credit to @mvanhorn for the original push on delegation and @huntharo for permissioning thought partnership. This branch supersedes the earlier direction from #364 and #365, now implemented as a beta rollout.

How to Test (Pre-Merge)

This is a beta skill — it requires manual invocation and does not affect the stable ce:work path.

Option A: Point at a local checkout (simplest)

git clone https://github.com/EveryInc/compound-engineering-plugin.git
cd compound-engineering-plugin
git checkout feat/codex-delegation-work

Then from any project directory:

claude --plugin-dir /path/to/compound-engineering-plugin/plugins/compound-engineering

Option B: Use the plugin-path CLI (keeps your checkout clean)

cd compound-engineering-plugin
bun install
bun run src/index.ts plugin-path compound-engineering --branch feat/codex-delegation-work

This outputs a --plugin-dir path. On re-run it pulls the latest from the remote branch.

claude --plugin-dir <path from above>

Try it out:

# Standard mode (no delegation) — should behave identically to ce:work
/ce:work-beta path/to/your-plan.md

# With Codex delegation (requires codex CLI installed)
/ce:work-beta delegate:codex path/to/your-plan.md

On first delegation, the skill prompts for one-time consent and sandbox mode selection.

Optional: --dangerously-skip-permissions reduces permission prompts during delegation (codex exec, git operations). Not required.

Prerequisites for delegation: Codex CLI installed and on PATH (npm install -g @openai/codex).

Key Findings from Evaluation

6 iterations of eval with real code implementation in isolated worktrees:

Plan size Units Delegate tokens Standard tokens Overhead Verdict
Small 1-3 51-63k 38-42k +34-50% Not worth it for token savings
Medium 4 54k 53k +2% Marginal
Large 7 62k 62k +1% Break-even
Extra-large 10 54k 62k -13% Delegation is cheaper

Crossover point: ~5-7 units. Above that, delegation saves Claude tokens. Users may still choose delegation below the crossover for cost arbitrage (Codex tokens cheaper) or coding preference.

Architecture

Reference file extraction (the biggest optimization)

Delegation content (~250 lines) extracted from SKILL.md body to references/codex-delegation-workflow.md, loaded on demand only when delegation is active. SKILL.md shrank from 776 to 514 lines -- a 34% body reduction saving ~15k Claude tokens per non-delegation run.

Batched execution model

Replaced per-unit delegation (N codex exec calls) with batched delegation (ceil(N/5) calls). All units in one batch for plans <=5 units; split at roughly 5 for larger plans. Reduced orchestration from O(N) to O(batches).

Codex owns verify

Codex runs tests and fixes failures within the delegation -- the orchestrator does not re-verify independently. Safety net: circuit breaker (3 consecutive failures -> standard mode fallback), plus Phase 3 full test suite before shipping.

flowchart TD
  A[ce:work-beta invoked with plan] --> B{delegation_active?}
  B -- no --> C[Standard execution]
  B -- yes --> D[Read delegation workflow reference]

  D --> E[Pre-delegation checks: platform, env, CLI, consent]
  E -- any fail --> C
  E -- all pass --> F{work_delegate_decision}

  F -- auto --> G[State plan and proceed]
  F -- ask --> H[Present recommendation, wait for choice]
  H -- Claude Code --> C
  H -- Codex --> G

  G --> I{All units trivial?}
  I -- yes --> C
  I -- no --> J[Batch units]

  J --> K[Write prompt with testing + verify sections]
  K --> L[codex exec per batch]
  L --> M{Result classification}
  M -- success --> N[Commit batch]
  M -- partial --> O[Finish locally, commit]
  M -- failure --> P[Rollback, increment circuit breaker]
  P --> Q{3 failures?}
  Q -- yes --> C
  Q -- no --> K

  N --> R{More batches?}
  R -- yes --> S[Report progress, continue]
  S --> K
  R -- no --> T[Cleanup scratch, Phase 3]
Loading

What Changed

plugins/compound-engineering/
  AGENTS.md                                  # refined shell-chaining rule
  compound-engineering.local.example.md      # NEW — settings template
  skills/
    ce-work-beta/
      SKILL.md                               # slimmed body (776 -> 514 lines)
      references/
        codex-delegation-workflow.md          # NEW — full delegation workflow
        swarm-mode.md                        # NEW — extracted from body
    ce-work/
      SKILL.md                               # stable, non-delegating
    ce-plan/
      SKILL.md                               # neutral, handoffs stay on ce:work

docs/
  brainstorms/
    2026-03-31-codex-delegation-requirements.md   # NEW — 19 requirements
  plans/
    2026-03-31-001-feat-codex-delegation-plan.md  # NEW — 6 implementation units
  solutions/
    best-practices/
      codex-delegation-best-practices-2026-04-01.md  # NEW — compounded learning
    skill-design/
      ce-work-beta-promotion-checklist-2026-03-31.md # NEW — promotion steps

tests/
  pipeline-review-contract.test.ts           # delegation contract coverage

Prompt Engineering for Delegation

The Codex prompt template includes:

  • <testing> section -- Test Scenario Completeness guidance (happy path, edge cases, error paths, integration). Improved Codex test output by ~35% on large plans.
  • <verify> section -- requires running ALL tests in a single command (not per-file) to catch cross-file contamination. Discovered in eval when mocked globalThis.fetch leaked between test files.
  • <constraints> -- no git commits, scoped changes, honest result reporting
  • <output_contract> -- structured result schema for classification

User Settings

# .claude/compound-engineering.local.md
---
work_delegate: codex              # codex | false (default: false)
work_delegate_consent: true       # true | false (default: false)
work_delegate_sandbox: yolo       # yolo | full-auto (default: yolo)
work_delegate_decision: auto      # auto | ask (default: auto)
---

Promotion Follow-Up

When ready to promote beta -> stable:

  • Copy validated implementation from ce:work-beta into ce:work
  • Remove beta-only manual-invocation caveats
  • Update planner and workflow handoffs atomically
  • Move contract tests from beta surface to stable surface
  • Retire or redirect ce:work-beta

See docs/solutions/skill-design/ce-work-beta-promotion-checklist-2026-03-31.md for detailed steps.

Test plan

  • bun test tests/pipeline-review-contract.test.ts -- 24 pass, delegation contract verified
  • bun test -- 547 pass (1 pre-existing path sanitization failure)
  • 6 iterations of eval: routing correctness, code correctness, token measurement, circuit breaker fallback
  • Codex delegation end-to-end: prompt generation -> codex exec -> result classification -> commit (verified with real codex CLI)

Post-Deploy Monitoring & Validation

No additional operational monitoring required. Changes are limited to skill contracts, reference files, planning docs, and test coverage in the plugin repo. The beta skill requires manual invocation (ce:work-beta) and does not affect the stable ce:work path.


Compound Engineering v2.60.0
🤖 Generated with Claude Opus 4.6 (1M context, extended thinking) via Claude Code

tmchow and others added 2 commits March 31, 2026 19:57
Adds optional `delegate:codex` mode to ce:work that delegates code
implementation to the Codex CLI (`codex exec`) using concrete bash
templates. Replaces ce-work-beta's prose-based delegation which caused
non-deterministic CLI invocations.

Key additions:
- Argument parsing with `delegate:codex`/`delegate:local` tokens and
  resolution chain (argument > local.md > default off)
- Pre-delegation gates: environment guard, availability check, one-time
  consent flow with sandbox mode selection (yolo/full-auto)
- XML-tagged prompt template following gpt-5-4-prompting best practices
- Multi-signal result classification (CLI fail/task fail/partial/verify
  fail/success) with rollback-to-HEAD safety
- Circuit breaker: 3 consecutive failures -> standard mode fallback
- Serial execution enforced, swarm mode mutual exclusion
- Frontend Design Guidance ported from ce-work-beta
- ce-work-beta delegation section marked superseded
- `Execution target: external-delegate` removed from ce:plan

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Remove stale "external delegation" per-unit posture from ce:plan
   Execution note examples — ce:work reads delegation from the global
   resolution chain, not unit metadata.

2. Fix delegation fallback to re-enter standard strategy selection.
   Pre-delegation checks now run inside the routing gate before strategy
   choice, so disabling delegation falls through to the normal
   inline/serial/parallel table instead of silently defaulting to inline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4304970c34

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@tmchow tmchow marked this pull request as draft April 1, 2026 03:24
Restore stable ce:work as the non-delegating execution path during the
Codex beta rollout and move the active delegation contract back to
ce:work-beta.

Also add a promotion checklist doc covering the workflow and contract
changes required when ce:work-beta is later promoted to stable.
@tmchow tmchow changed the title feat(ce-work): add Codex delegation mode feat(ce-work-beta): add beta Codex delegation mode Apr 1, 2026
@tmchow
Copy link
Copy Markdown
Collaborator Author

tmchow commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3f7069e77

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tmchow and others added 2 commits March 31, 2026 23:21
…acked files

Preflight now uses `git diff --quiet HEAD` instead of `git status --short`
so untracked workspace dirs and .context/ scratch don't falsely block
delegation. Rollback uses path-scoped `git clean -fd -- <unit files>`
instead of bare `git clean -fd` which would nuke all untracked files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…antics

The old `test -n || test -n` returned exit code 1 on the happy path
(neither var set), which a literal agent could misread as a failed
pre-check and disable delegation in eligible environments.

Rewrote as an explicit if/else so pass/fail lives in the variable
value, not the exit code. Also refined the AGENTS.md shell-chaining
rule to distinguish action chaining (bad) from boolean conditions
in if/while guards (fine).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tmchow
Copy link
Copy Markdown
Collaborator Author

tmchow commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3f7069e77

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tmchow and others added 15 commits April 1, 2026 12:42
Moves ~270 lines of delegation workflow (pre-checks, prompt template,
execution loop, result classification) to references/codex-delegation-workflow.md
and ~25 lines of swarm mode to references/swarm-mode.md. SKILL.md body
drops from ~776 to ~514 lines — a 34% reduction in per-tool-call context
cost for non-delegation runs.

New in the delegation reference:
- Batched execution model (all units in one batch, split at ~5)
- Codex owns VERIFY (test-fix loop inside delegation)
- Platform gate (Claude Code only)
- Run-ID namespaced scratch files for concurrent safety
- work_delegation_decision setting (auto/ask) with user-facing prompts
- Between-batch checkpoints (flow through by default)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provides a discoverable template with all available settings
(work_delegate, work_codex_consent, work_codex_sandbox,
work_delegation_decision) so users can copy to
.claude/compound-engineering.local.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… prompt

Adds a <testing> section to the delegation prompt template that carries
the Test Scenario Completeness guidance to Codex (cover happy path, edge
cases, error paths, integration). Closes the test quality gap observed in
evals (Codex produced 57-85% as many tests without this guidance).

Also updates <verify> to require running ALL test files in a single
command rather than per-file — catches cross-file contamination like
mocked globals leaking between test files in the same bun process.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compounded learning from 6 iterations of delegation evals covering
token crossover points, batching strategy, prompt engineering, skill
body size as multiplicative cost driver, and user choice considerations.

Key findings: delegation breaks even at ~5-7 units and becomes cheaper
at 10+. Skill body size dominates cost (multiplicative across all tool
calls). Extract conditional content >50 lines to reference files.

Also fixes verify section line break in contract test assertion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents how plan structure directly enables delegation decisions —
file lists enable batching rules, test scenarios feed Codex prompts,
verification commands enable Codex's self-check loop. Delegation works
with unstructured plans but makes conservative choices without signals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… net

Expands the "delegate verify to Codex" pattern with the reasoning:
trust the delegate's self-report, protect against systematic failure
with the circuit breaker (3 consecutive failures -> standard mode),
and verify the whole at Phase 3 before shipping. Three layered catches
replace the redundant per-batch orchestrator verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The doc covers economics, architecture, prompt engineering, plan quality,
safety model, and user choice — not just economics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds actual token counts from all iterations (not just percentages),
wall clock time comparison, test coverage cost, and the iteration
evolution table showing the body-size regression and recovery. The
economics section now tells the complete story with raw numbers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Renames for consistency across all delegation settings:
- work_codex_consent  -> work_delegate_consent
- work_codex_sandbox  -> work_delegate_sandbox
- work_delegation_decision -> work_delegate_decision

All four settings now share the work_delegate_* prefix:
  work_delegate, work_delegate_consent,
  work_delegate_sandbox, work_delegate_decision

Updated across: SKILL.md, delegation workflow reference, example
local.md, best practices doc, requirements, plan, and contract tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unrecognized setting values (e.g., work_delegate: gemini) now fall
through to hard defaults instead of producing undefined behavior.
Each setting documents its recognized values inline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shorter, more natural. Users think in terms of "codex mode" not
"delegate to codex." The mode: prefix is generic enough for future
delegates. Deactivation becomes mode:local.

Fuzzy activation phrases unchanged (use codex, codex mode, etc.).

Updated across: SKILL.md, contract tests, requirements, and plan docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds work_delegate_model (default: gpt-5.4) and work_delegate_effort
(default: high) to local.md settings. Both are passed explicitly to
codex exec via -m and -c 'model_reasoning_effort="..."' flags.

Model is passthrough (any valid codex model name). Effort is validated
against 5 values: minimal, low, medium, high, xhigh. Invalid values
fall through to defaults.

Also fixes --yolo to --dangerously-bypass-approvals-and-sandbox (the
documented flag name) and adds quoting guidance for the -c flag.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odex exec

Launches codex exec with run_in_background (no timeout ceiling) then
polls every 10 seconds in a foreground bash loop to keep the agent's
turn active. User sees "Waiting for Codex..." during execution and
cannot interfere with the working tree.

Fixes the 10-minute Bash timeout ceiling that would kill long-running
batches where Codex is iterating on test fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant