Skip to content

v3.2: Hardened evals + GOLD example (humanizer-skill schedule)#2

Merged
Charpup merged 2 commits intomainfrom
feat/v3.2-evals-and-examples
Apr 18, 2026
Merged

v3.2: Hardened evals + GOLD example (humanizer-skill schedule)#2
Charpup merged 2 commits intomainfrom
feat/v3.2-evals-and-examples

Conversation

@Charpup
Copy link
Copy Markdown
Owner

@Charpup Charpup commented Apr 18, 2026

Summary

Round-2 standardization pass. Additive; no breaking changes.

Changes

New: `examples/humanizer-skill-schedule/` (GOLD)

Real completed run from the humanizer-skill build (merged). 21 tasks across 9 author-defined batches, which the scheduler reshapes into 12 dependency-aware batches. 4 files:

  • `README.md` — explains GOLD labeling and characteristic lessons
  • `input-task_plan.md` — verbatim snapshot of the real plan
  • `output-schedule.json` — full DAG output (tasks_extracted + batches + per-batch metadata + summary)
  • `output-handoff-snippet.json` — how the same schedule lives inside triadev-handoff.json

Demonstrates:

  • Fan-out: T7 (merge 32 patterns, complexity 7) unlocks 5 downstream references
  • Fan-in: T13 (SKILL.md draft) requires all 5 reference files
  • Within-batch complexity ordering: T1(1), T2(1), T16(1), T6(4) in batch 1
  • Critical path length 12, total complexity sum 35, max parallelism 5

Hardened: `evals/evals.json` (4 → 8 cases)

New test cases:

  • batch-ordering-complexity-01: verifies within-batch tasks ordered ascending by complexity
  • cross-session-persistence-01: CST 00:00 migration behavior — completed stays, running+pending migrate
  • standalone-mode-01: works without triadev-handoff.json; reads task_plan.md directly
  • dynamic-insertion-01: mid-execution insertion re-schedules only non-completed tasks

Assertion types shifted toward determinism. Introduced `sequence_order` (for batch ordering) and `json_path_length`. Now ~80% deterministic vs ~50% in v3.1.

Rewritten

  • `README.md` aligned with sister-skill style. What's New in v3.2 table, Working Example callout, changelog entry.

Test plan

  • Run all 8 eval cases; confirm expected outputs
  • Smoke test on the humanizer-skill input task_plan.md; verify output matches `examples/humanizer-skill-schedule/output-schedule.json`
  • Verify batch ordering: tasks with identical dependencies sort ascending by complexity

Part of Round-2 wave

Companion PRs:

🤖 Generated with Claude Code

Round-2 standardization pass. All additions additive; no breaking changes.

## Changes

### examples/humanizer-skill-schedule/ (new)
Real completed run from the humanizer-skill build (PR blader/humanizer#94 merged).
21 tasks across 9 author-defined batches, which the scheduler reshapes into 12
dependency-aware batches. 4 files:

- README.md: explains GOLD labeling and characteristic lessons
- input-task_plan.md: verbatim snapshot of the real plan
- output-schedule.json: full DAG output with tasks_extracted, batches,
  batch_metadata (parallelism + max complexity per batch + notes), summary
  (critical path, complexity sum, max parallelism)
- output-handoff-snippet.json: how the same schedule lives inside
  triadev-handoff.json (excerpt, planning + scheduling sections only)

Demonstrates fan-out (T7 → 5 downstream refs), fan-in (T13 ← 5 refs),
within-batch complexity ordering (T1=1, T2=1, T16=1, T6=4 in batch 1).

### evals/evals.json (4 → 8 cases)
New coverage:
- batch-ordering-complexity-01: verifies within-batch tasks are ordered
  ascending by complexity (no-dep tasks only)
- cross-session-persistence-01: CST 00:00 migration — completed tasks stay,
  running + pending get "migrated" status and carried over
- standalone-mode-01: task-workflow works without triadev-handoff.json,
  reading task_plan.md directly; must NOT create triadev-handoff.json
- dynamic-insertion-01: mid-execution insertion of a new task; only
  remaining (non-completed) tasks get re-scheduled

Assertion types shifted:
- Added sequence_order type for batch ordering verification
- Added json_path_length for assertion on batches array size
- ~80% of assertions now deterministic (file_exists, json_path_*, sequence_order,
  contains_all/any, does_not_contain) vs ~20% llm_judge (was 50/50 in v3.1)

### README.md
Full rewrite aligning with sister-skill docs. Adds What's New in v3.2 table,
Working Example callout, changelog v3.2 entry.

## Verification

- 3 new JSON files parse cleanly
- Example schedule consistent with SKILL.md's batch-ordering rules
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c6b0235e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +14 to +15
{"id": "T7", "name": "构建 32 模式合并分类体系", "complexity": 7, "dependencies": ["T4", "T5", "T6"]},
{"id": "T13", "name": "SKILL.md 主体", "complexity": 5, "dependencies": ["T8", "T9", "T10", "T11", "T12"]}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep handoff snippet DAG self-consistent

This tasks_extracted sample is internally invalid: T7 and T13 depend on IDs (T4/T5/T6, T8T12) that are not present in the same array, while scheduling.batches also references many omitted IDs. Because the scheduler requires every dependency ID to exist, this GOLD example cannot be replayed as a valid handoff fixture and can fail validation for users who copy it.

Useful? React with 👍 / 👎.

"total_tasks": 21,
"total_batches": 12,
"critical_path": ["T1", "T3", "T4", "T7", "T8", "T13", "T14", "T17", "T18", "T19", "T20", "T21"],
"critical_path_complexity_sum": 35,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute reported critical-path complexity sum

The critical_path_complexity_sum is inconsistent with the provided data: summing complexities for the listed critical_path (T1,T3,T4,T7,T8,T13,T14,T17,T18,T19,T20,T21) yields 33, not 35. This makes the GOLD reference metrics inaccurate and can mislead downstream validation or documentation that trusts this summary field.

Useful? React with 👍 / 👎.

…lexity sum (33, not 35)

Addresses Codex review on PR #2.

## P2 — examples/humanizer-skill-schedule/output-handoff-snippet.json

Previously the snippet's planning.tasks_extracted had only 5 tasks (T1, T2,
T3, T7, T13) while scheduling.batches and task.dependencies referenced IDs
T4, T5, T6, T8–T12, T14–T21. That made the excerpt internally invalid as a
handoff fixture — a scheduler consuming this snippet would fail on missing
dependency IDs.

Fix: inlined the full 21-task tasks_extracted array. The snippet is now
self-consistent: every ID in dependencies and batches exists in
tasks_extracted. Updated the _comment to note this.

## P2 — examples/humanizer-skill-schedule/output-schedule.json:59

critical_path_complexity_sum was 35 but recomputing against the listed
critical_path gives 33:

  T1(1) + T3(1) + T4(3) + T7(7) + T8(5) + T13(5) + T14(3) + T17(2)
     + T18(2) + T19(2) + T20(1) + T21(1) = 33

Fixed to 33. Also updated README.md which mentions the number twice
("complexity sum 35" → "complexity sum 33") so human readers and
machine readers agree.
@Charpup Charpup merged commit 56c8bec into main Apr 18, 2026
@Charpup Charpup deleted the feat/v3.2-evals-and-examples branch April 18, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant