feat(ce-work-beta): add beta Codex delegation mode#476
Conversation
Adds optional `delegate:codex` mode to ce:work that delegates code implementation to the Codex CLI (`codex exec`) using concrete bash templates. Replaces ce-work-beta's prose-based delegation which caused non-deterministic CLI invocations. Key additions: - Argument parsing with `delegate:codex`/`delegate:local` tokens and resolution chain (argument > local.md > default off) - Pre-delegation gates: environment guard, availability check, one-time consent flow with sandbox mode selection (yolo/full-auto) - XML-tagged prompt template following gpt-5-4-prompting best practices - Multi-signal result classification (CLI fail/task fail/partial/verify fail/success) with rollback-to-HEAD safety - Circuit breaker: 3 consecutive failures -> standard mode fallback - Serial execution enforced, swarm mode mutual exclusion - Frontend Design Guidance ported from ce-work-beta - ce-work-beta delegation section marked superseded - `Execution target: external-delegate` removed from ce:plan Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Remove stale "external delegation" per-unit posture from ce:plan Execution note examples — ce:work reads delegation from the global resolution chain, not unit metadata. 2. Fix delegation fallback to re-enter standard strategy selection. Pre-delegation checks now run inside the routing gate before strategy choice, so disabling delegation falls through to the normal inline/serial/parallel table instead of silently defaulting to inline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4304970c34
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Restore stable ce:work as the non-delegating execution path during the Codex beta rollout and move the active delegation contract back to ce:work-beta. Also add a promotion checklist doc covering the workflow and contract changes required when ce:work-beta is later promoted to stable.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3f7069e77
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…acked files Preflight now uses `git diff --quiet HEAD` instead of `git status --short` so untracked workspace dirs and .context/ scratch don't falsely block delegation. Rollback uses path-scoped `git clean -fd -- <unit files>` instead of bare `git clean -fd` which would nuke all untracked files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…antics The old `test -n || test -n` returned exit code 1 on the happy path (neither var set), which a literal agent could misread as a failed pre-check and disable delegation in eligible environments. Rewrote as an explicit if/else so pass/fail lives in the variable value, not the exit code. Also refined the AGENTS.md shell-chaining rule to distinguish action chaining (bad) from boolean conditions in if/while guards (fine). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3f7069e77
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Moves ~270 lines of delegation workflow (pre-checks, prompt template, execution loop, result classification) to references/codex-delegation-workflow.md and ~25 lines of swarm mode to references/swarm-mode.md. SKILL.md body drops from ~776 to ~514 lines — a 34% reduction in per-tool-call context cost for non-delegation runs. New in the delegation reference: - Batched execution model (all units in one batch, split at ~5) - Codex owns VERIFY (test-fix loop inside delegation) - Platform gate (Claude Code only) - Run-ID namespaced scratch files for concurrent safety - work_delegation_decision setting (auto/ask) with user-facing prompts - Between-batch checkpoints (flow through by default) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provides a discoverable template with all available settings (work_delegate, work_codex_consent, work_codex_sandbox, work_delegation_decision) so users can copy to .claude/compound-engineering.local.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… prompt Adds a <testing> section to the delegation prompt template that carries the Test Scenario Completeness guidance to Codex (cover happy path, edge cases, error paths, integration). Closes the test quality gap observed in evals (Codex produced 57-85% as many tests without this guidance). Also updates <verify> to require running ALL test files in a single command rather than per-file — catches cross-file contamination like mocked globals leaking between test files in the same bun process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compounded learning from 6 iterations of delegation evals covering token crossover points, batching strategy, prompt engineering, skill body size as multiplicative cost driver, and user choice considerations. Key findings: delegation breaks even at ~5-7 units and becomes cheaper at 10+. Skill body size dominates cost (multiplicative across all tool calls). Extract conditional content >50 lines to reference files. Also fixes verify section line break in contract test assertion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents how plan structure directly enables delegation decisions — file lists enable batching rules, test scenarios feed Codex prompts, verification commands enable Codex's self-check loop. Delegation works with unstructured plans but makes conservative choices without signals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… net Expands the "delegate verify to Codex" pattern with the reasoning: trust the delegate's self-report, protect against systematic failure with the circuit breaker (3 consecutive failures -> standard mode), and verify the whole at Phase 3 before shipping. Three layered catches replace the redundant per-batch orchestrator verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The doc covers economics, architecture, prompt engineering, plan quality, safety model, and user choice — not just economics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds actual token counts from all iterations (not just percentages), wall clock time comparison, test coverage cost, and the iteration evolution table showing the body-size regression and recovery. The economics section now tells the complete story with raw numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Renames for consistency across all delegation settings: - work_codex_consent -> work_delegate_consent - work_codex_sandbox -> work_delegate_sandbox - work_delegation_decision -> work_delegate_decision All four settings now share the work_delegate_* prefix: work_delegate, work_delegate_consent, work_delegate_sandbox, work_delegate_decision Updated across: SKILL.md, delegation workflow reference, example local.md, best practices doc, requirements, plan, and contract tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unrecognized setting values (e.g., work_delegate: gemini) now fall through to hard defaults instead of producing undefined behavior. Each setting documents its recognized values inline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shorter, more natural. Users think in terms of "codex mode" not "delegate to codex." The mode: prefix is generic enough for future delegates. Deactivation becomes mode:local. Fuzzy activation phrases unchanged (use codex, codex mode, etc.). Updated across: SKILL.md, contract tests, requirements, and plan docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This reverts commit 4c5da21.
Adds work_delegate_model (default: gpt-5.4) and work_delegate_effort (default: high) to local.md settings. Both are passed explicitly to codex exec via -m and -c 'model_reasoning_effort="..."' flags. Model is passthrough (any valid codex model name). Effort is validated against 5 values: minimal, low, medium, high, xhigh. Invalid values fall through to defaults. Also fixes --yolo to --dangerously-bypass-approvals-and-sandbox (the documented flag name) and adds quoting guidance for the -c flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odex exec Launches codex exec with run_in_background (no timeout ceiling) then polls every 10 seconds in a foreground bash loop to keep the agent's turn active. User sees "Waiting for Codex..." during execution and cannot interfere with the working tree. Fixes the 10-minute Bash timeout ceiling that would kill long-running batches where Codex is iterating on test fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Adds Codex delegation as a beta-only execution path in
ce:work-beta, validated through 6 iterations of evaluation comparing delegation vs standard mode across small, medium, large, and extra-large plans. The final architecture uses batched delegation, reference file extraction for token efficiency, and a<testing>prompt section that closes the test quality gap.Credit to @mvanhorn for the original push on delegation and @huntharo for permissioning thought partnership. This branch supersedes the earlier direction from #364 and #365, now implemented as a beta rollout.
How to Test (Pre-Merge)
This is a beta skill — it requires manual invocation and does not affect the stable
ce:workpath.Option A: Point at a local checkout (simplest)
git clone https://github.com/EveryInc/compound-engineering-plugin.git cd compound-engineering-plugin git checkout feat/codex-delegation-workThen from any project directory:
Option B: Use the plugin-path CLI (keeps your checkout clean)
cd compound-engineering-plugin bun install bun run src/index.ts plugin-path compound-engineering --branch feat/codex-delegation-workThis outputs a
--plugin-dirpath. On re-run it pulls the latest from the remote branch.Try it out:
On first delegation, the skill prompts for one-time consent and sandbox mode selection.
Optional:
--dangerously-skip-permissionsreduces permission prompts during delegation (codex exec, git operations). Not required.Prerequisites for delegation: Codex CLI installed and on PATH (
npm install -g @openai/codex).Key Findings from Evaluation
6 iterations of eval with real code implementation in isolated worktrees:
Crossover point: ~5-7 units. Above that, delegation saves Claude tokens. Users may still choose delegation below the crossover for cost arbitrage (Codex tokens cheaper) or coding preference.
Architecture
Reference file extraction (the biggest optimization)
Delegation content (~250 lines) extracted from SKILL.md body to
references/codex-delegation-workflow.md, loaded on demand only when delegation is active. SKILL.md shrank from 776 to 514 lines -- a 34% body reduction saving ~15k Claude tokens per non-delegation run.Batched execution model
Replaced per-unit delegation (N codex exec calls) with batched delegation (ceil(N/5) calls). All units in one batch for plans <=5 units; split at roughly 5 for larger plans. Reduced orchestration from O(N) to O(batches).
Codex owns verify
Codex runs tests and fixes failures within the delegation -- the orchestrator does not re-verify independently. Safety net: circuit breaker (3 consecutive failures -> standard mode fallback), plus Phase 3 full test suite before shipping.
flowchart TD A[ce:work-beta invoked with plan] --> B{delegation_active?} B -- no --> C[Standard execution] B -- yes --> D[Read delegation workflow reference] D --> E[Pre-delegation checks: platform, env, CLI, consent] E -- any fail --> C E -- all pass --> F{work_delegate_decision} F -- auto --> G[State plan and proceed] F -- ask --> H[Present recommendation, wait for choice] H -- Claude Code --> C H -- Codex --> G G --> I{All units trivial?} I -- yes --> C I -- no --> J[Batch units] J --> K[Write prompt with testing + verify sections] K --> L[codex exec per batch] L --> M{Result classification} M -- success --> N[Commit batch] M -- partial --> O[Finish locally, commit] M -- failure --> P[Rollback, increment circuit breaker] P --> Q{3 failures?} Q -- yes --> C Q -- no --> K N --> R{More batches?} R -- yes --> S[Report progress, continue] S --> K R -- no --> T[Cleanup scratch, Phase 3]What Changed
Prompt Engineering for Delegation
The Codex prompt template includes:
<testing>section -- Test Scenario Completeness guidance (happy path, edge cases, error paths, integration). Improved Codex test output by ~35% on large plans.<verify>section -- requires running ALL tests in a single command (not per-file) to catch cross-file contamination. Discovered in eval when mockedglobalThis.fetchleaked between test files.<constraints>-- no git commits, scoped changes, honest result reporting<output_contract>-- structured result schema for classificationUser Settings
Promotion Follow-Up
When ready to promote beta -> stable:
ce:work-betaintoce:workce:work-betaSee
docs/solutions/skill-design/ce-work-beta-promotion-checklist-2026-03-31.mdfor detailed steps.Test plan
bun test tests/pipeline-review-contract.test.ts-- 24 pass, delegation contract verifiedbun test-- 547 pass (1 pre-existing path sanitization failure)Post-Deploy Monitoring & Validation
No additional operational monitoring required. Changes are limited to skill contracts, reference files, planning docs, and test coverage in the plugin repo. The beta skill requires manual invocation (
ce:work-beta) and does not affect the stablece:workpath.🤖 Generated with Claude Opus 4.6 (1M context, extended thinking) via Claude Code