v3.2: Hardened evals + GOLD example (humanizer-skill schedule)#2
v3.2: Hardened evals + GOLD example (humanizer-skill schedule)#2
Conversation
Round-2 standardization pass. All additions additive; no breaking changes. ## Changes ### examples/humanizer-skill-schedule/ (new) Real completed run from the humanizer-skill build (PR blader/humanizer#94 merged). 21 tasks across 9 author-defined batches, which the scheduler reshapes into 12 dependency-aware batches. 4 files: - README.md: explains GOLD labeling and characteristic lessons - input-task_plan.md: verbatim snapshot of the real plan - output-schedule.json: full DAG output with tasks_extracted, batches, batch_metadata (parallelism + max complexity per batch + notes), summary (critical path, complexity sum, max parallelism) - output-handoff-snippet.json: how the same schedule lives inside triadev-handoff.json (excerpt, planning + scheduling sections only) Demonstrates fan-out (T7 → 5 downstream refs), fan-in (T13 ← 5 refs), within-batch complexity ordering (T1=1, T2=1, T16=1, T6=4 in batch 1). ### evals/evals.json (4 → 8 cases) New coverage: - batch-ordering-complexity-01: verifies within-batch tasks are ordered ascending by complexity (no-dep tasks only) - cross-session-persistence-01: CST 00:00 migration — completed tasks stay, running + pending get "migrated" status and carried over - standalone-mode-01: task-workflow works without triadev-handoff.json, reading task_plan.md directly; must NOT create triadev-handoff.json - dynamic-insertion-01: mid-execution insertion of a new task; only remaining (non-completed) tasks get re-scheduled Assertion types shifted: - Added sequence_order type for batch ordering verification - Added json_path_length for assertion on batches array size - ~80% of assertions now deterministic (file_exists, json_path_*, sequence_order, contains_all/any, does_not_contain) vs ~20% llm_judge (was 50/50 in v3.1) ### README.md Full rewrite aligning with sister-skill docs. Adds What's New in v3.2 table, Working Example callout, changelog v3.2 entry. ## Verification - 3 new JSON files parse cleanly - Example schedule consistent with SKILL.md's batch-ordering rules
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8c6b0235e1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| {"id": "T7", "name": "构建 32 模式合并分类体系", "complexity": 7, "dependencies": ["T4", "T5", "T6"]}, | ||
| {"id": "T13", "name": "SKILL.md 主体", "complexity": 5, "dependencies": ["T8", "T9", "T10", "T11", "T12"]} |
There was a problem hiding this comment.
Keep handoff snippet DAG self-consistent
This tasks_extracted sample is internally invalid: T7 and T13 depend on IDs (T4/T5/T6, T8–T12) that are not present in the same array, while scheduling.batches also references many omitted IDs. Because the scheduler requires every dependency ID to exist, this GOLD example cannot be replayed as a valid handoff fixture and can fail validation for users who copy it.
Useful? React with 👍 / 👎.
| "total_tasks": 21, | ||
| "total_batches": 12, | ||
| "critical_path": ["T1", "T3", "T4", "T7", "T8", "T13", "T14", "T17", "T18", "T19", "T20", "T21"], | ||
| "critical_path_complexity_sum": 35, |
There was a problem hiding this comment.
Recompute reported critical-path complexity sum
The critical_path_complexity_sum is inconsistent with the provided data: summing complexities for the listed critical_path (T1,T3,T4,T7,T8,T13,T14,T17,T18,T19,T20,T21) yields 33, not 35. This makes the GOLD reference metrics inaccurate and can mislead downstream validation or documentation that trusts this summary field.
Useful? React with 👍 / 👎.
…lexity sum (33, not 35) Addresses Codex review on PR #2. ## P2 — examples/humanizer-skill-schedule/output-handoff-snippet.json Previously the snippet's planning.tasks_extracted had only 5 tasks (T1, T2, T3, T7, T13) while scheduling.batches and task.dependencies referenced IDs T4, T5, T6, T8–T12, T14–T21. That made the excerpt internally invalid as a handoff fixture — a scheduler consuming this snippet would fail on missing dependency IDs. Fix: inlined the full 21-task tasks_extracted array. The snippet is now self-consistent: every ID in dependencies and batches exists in tasks_extracted. Updated the _comment to note this. ## P2 — examples/humanizer-skill-schedule/output-schedule.json:59 critical_path_complexity_sum was 35 but recomputing against the listed critical_path gives 33: T1(1) + T3(1) + T4(3) + T7(7) + T8(5) + T13(5) + T14(3) + T17(2) + T18(2) + T19(2) + T20(1) + T21(1) = 33 Fixed to 33. Also updated README.md which mentions the number twice ("complexity sum 35" → "complexity sum 33") so human readers and machine readers agree.
Summary
Round-2 standardization pass. Additive; no breaking changes.
Changes
New: `examples/humanizer-skill-schedule/` (GOLD)
Real completed run from the humanizer-skill build (merged). 21 tasks across 9 author-defined batches, which the scheduler reshapes into 12 dependency-aware batches. 4 files:
Demonstrates:
Hardened: `evals/evals.json` (4 → 8 cases)
New test cases:
Assertion types shifted toward determinism. Introduced `sequence_order` (for batch ordering) and `json_path_length`. Now ~80% deterministic vs ~50% in v3.1.
Rewritten
Test plan
Part of Round-2 wave
Companion PRs:
🤖 Generated with Claude Code