v3.2: Hardened evals — brownfield init, coverage gate, state-machine enforcement by Charpup · Pull Request #3 · Charpup/openclaw-tdd-sdd-skill

Charpup · 2026-04-18T07:34:27Z

Summary

Round-2 evals hardening. Additive; no breaking changes. The existing `examples/pdf-ocr-skill/` (GOLD) and `examples/tdd-demo/` remain canonical — no new examples added because the existing ones already cover the spectrum.

Changes

`evals/evals.json` (4 → 8 cases)

Four new test cases target distinct failure modes:

ID	What it catches
`brownfield-init-01`	Generates base SPEC.yaml from existing `src/user_service.py` signatures without modifying the existing code
`coverage-gate-01`	Refuses to mark cycle complete when `.tdd-state.json` coverage is 65% (< 80% threshold). Asserts the skill does NOT transition phase to "complete"
`state-machine-skip-red-01`	Refuses to skip the RED phase; enforces "write tests first, see them fail, record evidence". Asserts no src/ implementation written before tests
`green-without-red-evidence-01`	Broken `.tdd-state.json` with `red_evidence: null` but phase="green". Skill must refuse to advance to REFACTOR

Assertion mix

~75% deterministic now (was ~50% in v3.1). Introduced `yaml_path_equals` / `yaml_path_exists` for SPEC.yaml field verification. Reduces judge-leniency regressions.

`README.md`

Full rewrite aligning with sister-skill docs. Adds What's New in v3.2 table, Working Examples callouts distinguishing pdf-ocr-skill (GOLD) and tdd-demo (minimal), expanded changelog.

Why no new examples?

The existing `examples/pdf-ocr-skill/` is already GOLD (brownfield, 10+ files, full SPEC + implementation + tests). Adding redundant examples would be padding. Brainstorm verdict was "don't add examples without documented drift" and no drift was observed.

Test plan

Run all 8 eval cases; confirm expected outputs
Smoke: start a new brownfield TDD cycle on a small existing module; verify the coverage gate actually blocks completion below 80%
Regression: existing `examples/pdf-ocr-skill/` and `examples/tdd-demo/` still work as-is

Part of Round-2 wave

Companion PRs:

v3.1: Round-2 standardization — contracts/, phase-transitions, examples, hardened evals triadev#2 (v3.1 structural refinement)
v2.0: Sidecar JSON + Rubber-Stamp Guards + Devil's Advocate (4 → 18 files) value-first-gate#2 (v2.0 sidecar + rubber-stamp guards)
v3.2: Hardened evals + GOLD example (humanizer-skill schedule) openclaw-task-workflow#2 (v3.2 evals + GOLD example)

🤖 Generated with Claude Code

…chine enforcement Round-2 evals hardening. Additive; no breaking changes. ## evals/evals.json (4 → 8 cases) New test cases: - brownfield-init-01: reading existing src/user_service.py and generating a base SPEC.yaml from function signatures. Asserts no existing code is modified and requirements reflect all three functions (create_user, get_user, delete_user). - coverage-gate-01: .tdd-state.json shows coverage 65% (below 80% threshold); skill must refuse to mark cycle complete. Asserts does_not_contain "marking phase as complete" and contains_any "below threshold" / "add more tests". - state-machine-skip-red-01: request to skip RED phase and go straight to implementation. Skill must refuse and enforce RED-first rule. Asserts the response does NOT write src/ files without a red_evidence record. - green-without-red-evidence-01: broken .tdd-state.json with red_evidence:null but claiming phase is green. Skill must refuse to advance to REFACTOR, demand actual failing test output first. ## Assertion mix Shifted toward determinism. Introduced yaml_path_equals / yaml_path_exists for SPEC.yaml field verification. ~75% deterministic (file_exists, json_path_*, yaml_path_*, contains_*) vs 25% llm_judge (was ~50/50 in v3.1). ## README.md Full rewrite aligning with sister-skill docs. Adds What's New in v3.2 section, Working Examples callouts for the existing pdf-ocr-skill (GOLD) and tdd-demo (minimal), expanded changelog. ## Verification All 8 eval cases parse cleanly. Existing pdf-ocr-skill example (unchanged) remains the canonical reference for GOLD brownfield flow.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: efae90d1c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T07:37:06Z

+        {
+          "text": "No test files created for the rename",
+          "type": "does_not_contain",
+          "values": ["RED phase", "writing tests for counter", "test_counter"]


Remove RED phase from forbidden strings in negative-01

This does_not_contain check will fail valid answers that explicitly refuse TDD for a trivial rename using wording like “no RED phase is needed.” Because the assertion bans the exact phrase RED phase, compliant responses are scored as failures, which corrupts this eval’s signal for the intended behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-18T07:37:06Z

+        {
+          "text": "Does not transition to complete",
+          "type": "does_not_contain",
+          "values": ["marking phase as complete", "cycle complete", "phase complete"]


Allow negated cycle complete phrasing in coverage-gate-01

The forbidden phrase list includes cycle complete, but a correct refusal commonly says “you cannot mark this cycle complete.” With substring matching, that compliant refusal fails this assertion even though it enforces the 80% gate, so the test can systematically produce false negatives.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-18T07:37:06Z

+        {
+          "text": "Refuses to advance to REFACTOR",
+          "type": "does_not_contain",
+          "values": ["advancing to REFACTOR", "transitioning phase to refactor", "starting refactor phase"]


Permit negated advance to REFACTOR wording in refusal check

This refusal assertion bans phrases like advancing to REFACTOR, which are also likely to appear in correct negative responses (for example, “not advancing to REFACTOR”). As written, the check can mark valid guardrail behavior as failing, weakening the reliability of the new state-machine eval.

Useful? React with 👍 / 👎.

…on negated refusals Addresses Codex review on PR #3. Three eval assertions used substring matching against phrases that compliant refusals legitimately contain inside a negation, turning correct behavior into failing tests. ## P1 — negative-01 (evals.json:125) Removed "RED phase" from forbidden strings. A valid refusal to apply TDD on a trivial rename may say "no RED phase is needed" — previously flagged as failure. Kept the more specific forbiddens ("writing tests for counter", "test_counter") which still reject actual TDD-style output. ## P1 — coverage-gate-01 (evals.json:175) Removed "cycle complete" and "phase complete" (both fragments that appear in correct refusals like "you cannot mark this cycle complete" or "refuse to mark phase complete"). Replaced "phase complete" with the more intention-specific "transitioning to complete" which only appears when the agent is actually trying to advance state, not when refusing to. ## P1 — green-without-red-evidence-01 (evals.json:214) Removed "advancing to REFACTOR" which appears in valid negations like "not advancing to REFACTOR" or "refusing to advance to REFACTOR". Kept "transitioning phase to refactor" and "starting refactor phase" which are unambiguous agent-action phrasings unlikely to appear in refusals. ## Why not use regex negation or more sophisticated logic? Per SKILL.md's assertion_types_supported, does_not_contain is a simple substring-matching primitive intended to catch egregious output. Compound negation logic belongs in llm_judge. These evals already have a parallel contains_any / llm_judge assertion that captures the positive refusal signal; the does_not_contain is the belt-and-braces complement, and narrowing its scope to unambiguous action phrases is the right fix.

chatgpt-codex-connector Bot reviewed Apr 18, 2026

View reviewed changes

Charpup merged commit 2662738 into main Apr 18, 2026

Charpup deleted the feat/v3.2-evals-hardening branch April 18, 2026 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.2: Hardened evals — brownfield init, coverage gate, state-machine enforcement#3

v3.2: Hardened evals — brownfield init, coverage gate, state-machine enforcement#3
Charpup merged 2 commits intomainfrom
feat/v3.2-evals-hardening

Charpup commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Charpup commented Apr 18, 2026

Summary

Changes

`evals/evals.json` (4 → 8 cases)

Assertion mix

`README.md`

Why no new examples?

Test plan

Part of Round-2 wave

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant