Skip to content

v3.2: Hardened evals — brownfield init, coverage gate, state-machine enforcement#3

Merged
Charpup merged 2 commits intomainfrom
feat/v3.2-evals-hardening
Apr 18, 2026
Merged

v3.2: Hardened evals — brownfield init, coverage gate, state-machine enforcement#3
Charpup merged 2 commits intomainfrom
feat/v3.2-evals-hardening

Conversation

@Charpup
Copy link
Copy Markdown
Owner

@Charpup Charpup commented Apr 18, 2026

Summary

Round-2 evals hardening. Additive; no breaking changes. The existing `examples/pdf-ocr-skill/` (GOLD) and `examples/tdd-demo/` remain canonical — no new examples added because the existing ones already cover the spectrum.

Changes

`evals/evals.json` (4 → 8 cases)

Four new test cases target distinct failure modes:

ID What it catches
`brownfield-init-01` Generates base SPEC.yaml from existing `src/user_service.py` signatures without modifying the existing code
`coverage-gate-01` Refuses to mark cycle complete when `.tdd-state.json` coverage is 65% (< 80% threshold). Asserts the skill does NOT transition phase to "complete"
`state-machine-skip-red-01` Refuses to skip the RED phase; enforces "write tests first, see them fail, record evidence". Asserts no src/ implementation written before tests
`green-without-red-evidence-01` Broken `.tdd-state.json` with `red_evidence: null` but phase="green". Skill must refuse to advance to REFACTOR

Assertion mix

~75% deterministic now (was ~50% in v3.1). Introduced `yaml_path_equals` / `yaml_path_exists` for SPEC.yaml field verification. Reduces judge-leniency regressions.

`README.md`

Full rewrite aligning with sister-skill docs. Adds What's New in v3.2 table, Working Examples callouts distinguishing pdf-ocr-skill (GOLD) and tdd-demo (minimal), expanded changelog.

Why no new examples?

The existing `examples/pdf-ocr-skill/` is already GOLD (brownfield, 10+ files, full SPEC + implementation + tests). Adding redundant examples would be padding. Brainstorm verdict was "don't add examples without documented drift" and no drift was observed.

Test plan

  • Run all 8 eval cases; confirm expected outputs
  • Smoke: start a new brownfield TDD cycle on a small existing module; verify the coverage gate actually blocks completion below 80%
  • Regression: existing `examples/pdf-ocr-skill/` and `examples/tdd-demo/` still work as-is

Part of Round-2 wave

Companion PRs:

🤖 Generated with Claude Code

…chine enforcement

Round-2 evals hardening. Additive; no breaking changes.

## evals/evals.json (4 → 8 cases)

New test cases:

- brownfield-init-01: reading existing src/user_service.py and generating a
  base SPEC.yaml from function signatures. Asserts no existing code is
  modified and requirements reflect all three functions (create_user,
  get_user, delete_user).
- coverage-gate-01: .tdd-state.json shows coverage 65% (below 80% threshold);
  skill must refuse to mark cycle complete. Asserts does_not_contain "marking
  phase as complete" and contains_any "below threshold" / "add more tests".
- state-machine-skip-red-01: request to skip RED phase and go straight to
  implementation. Skill must refuse and enforce RED-first rule. Asserts the
  response does NOT write src/ files without a red_evidence record.
- green-without-red-evidence-01: broken .tdd-state.json with red_evidence:null
  but claiming phase is green. Skill must refuse to advance to REFACTOR,
  demand actual failing test output first.

## Assertion mix

Shifted toward determinism. Introduced yaml_path_equals / yaml_path_exists
for SPEC.yaml field verification. ~75% deterministic (file_exists,
json_path_*, yaml_path_*, contains_*) vs 25% llm_judge (was ~50/50 in v3.1).

## README.md

Full rewrite aligning with sister-skill docs. Adds What's New in v3.2
section, Working Examples callouts for the existing pdf-ocr-skill (GOLD)
and tdd-demo (minimal), expanded changelog.

## Verification

All 8 eval cases parse cleanly. Existing pdf-ocr-skill example
(unchanged) remains the canonical reference for GOLD brownfield flow.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: efae90d1c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread evals/evals.json Outdated
{
"text": "No test files created for the rename",
"type": "does_not_contain",
"values": ["RED phase", "writing tests for counter", "test_counter"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove RED phase from forbidden strings in negative-01

This does_not_contain check will fail valid answers that explicitly refuse TDD for a trivial rename using wording like “no RED phase is needed.” Because the assertion bans the exact phrase RED phase, compliant responses are scored as failures, which corrupts this eval’s signal for the intended behavior.

Useful? React with 👍 / 👎.

Comment thread evals/evals.json Outdated
{
"text": "Does not transition to complete",
"type": "does_not_contain",
"values": ["marking phase as complete", "cycle complete", "phase complete"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow negated cycle complete phrasing in coverage-gate-01

The forbidden phrase list includes cycle complete, but a correct refusal commonly says “you cannot mark this cycle complete.” With substring matching, that compliant refusal fails this assertion even though it enforces the 80% gate, so the test can systematically produce false negatives.

Useful? React with 👍 / 👎.

Comment thread evals/evals.json Outdated
{
"text": "Refuses to advance to REFACTOR",
"type": "does_not_contain",
"values": ["advancing to REFACTOR", "transitioning phase to refactor", "starting refactor phase"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Permit negated advance to REFACTOR wording in refusal check

This refusal assertion bans phrases like advancing to REFACTOR, which are also likely to appear in correct negative responses (for example, “not advancing to REFACTOR”). As written, the check can mark valid guardrail behavior as failing, weakening the reliability of the new state-machine eval.

Useful? React with 👍 / 👎.

…on negated refusals

Addresses Codex review on PR #3. Three eval assertions used substring
matching against phrases that compliant refusals legitimately contain
inside a negation, turning correct behavior into failing tests.

## P1 — negative-01 (evals.json:125)

Removed "RED phase" from forbidden strings. A valid refusal to apply TDD
on a trivial rename may say "no RED phase is needed" — previously flagged
as failure. Kept the more specific forbiddens ("writing tests for counter",
"test_counter") which still reject actual TDD-style output.

## P1 — coverage-gate-01 (evals.json:175)

Removed "cycle complete" and "phase complete" (both fragments that appear
in correct refusals like "you cannot mark this cycle complete" or
"refuse to mark phase complete"). Replaced "phase complete" with the more
intention-specific "transitioning to complete" which only appears when the
agent is actually trying to advance state, not when refusing to.

## P1 — green-without-red-evidence-01 (evals.json:214)

Removed "advancing to REFACTOR" which appears in valid negations like
"not advancing to REFACTOR" or "refusing to advance to REFACTOR". Kept
"transitioning phase to refactor" and "starting refactor phase" which are
unambiguous agent-action phrasings unlikely to appear in refusals.

## Why not use regex negation or more sophisticated logic?

Per SKILL.md's assertion_types_supported, does_not_contain is a simple
substring-matching primitive intended to catch egregious output. Compound
negation logic belongs in llm_judge. These evals already have a parallel
contains_any / llm_judge assertion that captures the positive refusal
signal; the does_not_contain is the belt-and-braces complement, and
narrowing its scope to unambiguous action phrases is the right fix.
@Charpup Charpup merged commit 2662738 into main Apr 18, 2026
@Charpup Charpup deleted the feat/v3.2-evals-hardening branch April 18, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant