v3.2: Hardened evals — brownfield init, coverage gate, state-machine enforcement#3
v3.2: Hardened evals — brownfield init, coverage gate, state-machine enforcement#3
Conversation
…chine enforcement Round-2 evals hardening. Additive; no breaking changes. ## evals/evals.json (4 → 8 cases) New test cases: - brownfield-init-01: reading existing src/user_service.py and generating a base SPEC.yaml from function signatures. Asserts no existing code is modified and requirements reflect all three functions (create_user, get_user, delete_user). - coverage-gate-01: .tdd-state.json shows coverage 65% (below 80% threshold); skill must refuse to mark cycle complete. Asserts does_not_contain "marking phase as complete" and contains_any "below threshold" / "add more tests". - state-machine-skip-red-01: request to skip RED phase and go straight to implementation. Skill must refuse and enforce RED-first rule. Asserts the response does NOT write src/ files without a red_evidence record. - green-without-red-evidence-01: broken .tdd-state.json with red_evidence:null but claiming phase is green. Skill must refuse to advance to REFACTOR, demand actual failing test output first. ## Assertion mix Shifted toward determinism. Introduced yaml_path_equals / yaml_path_exists for SPEC.yaml field verification. ~75% deterministic (file_exists, json_path_*, yaml_path_*, contains_*) vs 25% llm_judge (was ~50/50 in v3.1). ## README.md Full rewrite aligning with sister-skill docs. Adds What's New in v3.2 section, Working Examples callouts for the existing pdf-ocr-skill (GOLD) and tdd-demo (minimal), expanded changelog. ## Verification All 8 eval cases parse cleanly. Existing pdf-ocr-skill example (unchanged) remains the canonical reference for GOLD brownfield flow.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: efae90d1c8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| { | ||
| "text": "No test files created for the rename", | ||
| "type": "does_not_contain", | ||
| "values": ["RED phase", "writing tests for counter", "test_counter"] |
There was a problem hiding this comment.
Remove
RED phase from forbidden strings in negative-01
This does_not_contain check will fail valid answers that explicitly refuse TDD for a trivial rename using wording like “no RED phase is needed.” Because the assertion bans the exact phrase RED phase, compliant responses are scored as failures, which corrupts this eval’s signal for the intended behavior.
Useful? React with 👍 / 👎.
| { | ||
| "text": "Does not transition to complete", | ||
| "type": "does_not_contain", | ||
| "values": ["marking phase as complete", "cycle complete", "phase complete"] |
There was a problem hiding this comment.
Allow negated
cycle complete phrasing in coverage-gate-01
The forbidden phrase list includes cycle complete, but a correct refusal commonly says “you cannot mark this cycle complete.” With substring matching, that compliant refusal fails this assertion even though it enforces the 80% gate, so the test can systematically produce false negatives.
Useful? React with 👍 / 👎.
| { | ||
| "text": "Refuses to advance to REFACTOR", | ||
| "type": "does_not_contain", | ||
| "values": ["advancing to REFACTOR", "transitioning phase to refactor", "starting refactor phase"] |
There was a problem hiding this comment.
Permit negated
advance to REFACTOR wording in refusal check
This refusal assertion bans phrases like advancing to REFACTOR, which are also likely to appear in correct negative responses (for example, “not advancing to REFACTOR”). As written, the check can mark valid guardrail behavior as failing, weakening the reliability of the new state-machine eval.
Useful? React with 👍 / 👎.
…on negated refusals Addresses Codex review on PR #3. Three eval assertions used substring matching against phrases that compliant refusals legitimately contain inside a negation, turning correct behavior into failing tests. ## P1 — negative-01 (evals.json:125) Removed "RED phase" from forbidden strings. A valid refusal to apply TDD on a trivial rename may say "no RED phase is needed" — previously flagged as failure. Kept the more specific forbiddens ("writing tests for counter", "test_counter") which still reject actual TDD-style output. ## P1 — coverage-gate-01 (evals.json:175) Removed "cycle complete" and "phase complete" (both fragments that appear in correct refusals like "you cannot mark this cycle complete" or "refuse to mark phase complete"). Replaced "phase complete" with the more intention-specific "transitioning to complete" which only appears when the agent is actually trying to advance state, not when refusing to. ## P1 — green-without-red-evidence-01 (evals.json:214) Removed "advancing to REFACTOR" which appears in valid negations like "not advancing to REFACTOR" or "refusing to advance to REFACTOR". Kept "transitioning phase to refactor" and "starting refactor phase" which are unambiguous agent-action phrasings unlikely to appear in refusals. ## Why not use regex negation or more sophisticated logic? Per SKILL.md's assertion_types_supported, does_not_contain is a simple substring-matching primitive intended to catch egregious output. Compound negation logic belongs in llm_judge. These evals already have a parallel contains_any / llm_judge assertion that captures the positive refusal signal; the does_not_contain is the belt-and-braces complement, and narrowing its scope to unambiguous action phrases is the right fix.
Summary
Round-2 evals hardening. Additive; no breaking changes. The existing `examples/pdf-ocr-skill/` (GOLD) and `examples/tdd-demo/` remain canonical — no new examples added because the existing ones already cover the spectrum.
Changes
`evals/evals.json` (4 → 8 cases)
Four new test cases target distinct failure modes:
Assertion mix
~75% deterministic now (was ~50% in v3.1). Introduced `yaml_path_equals` / `yaml_path_exists` for SPEC.yaml field verification. Reduces judge-leniency regressions.
`README.md`
Full rewrite aligning with sister-skill docs. Adds What's New in v3.2 table, Working Examples callouts distinguishing pdf-ocr-skill (GOLD) and tdd-demo (minimal), expanded changelog.
Why no new examples?
The existing `examples/pdf-ocr-skill/` is already GOLD (brownfield, 10+ files, full SPEC + implementation + tests). Adding redundant examples would be padding. Brainstorm verdict was "don't add examples without documented drift" and no drift was observed.
Test plan
Part of Round-2 wave
Companion PRs:
🤖 Generated with Claude Code