v3.2: Hardened evals + GOLD example (humanizer-skill schedule) by Charpup · Pull Request #2 · Charpup/openclaw-task-workflow

Charpup · 2026-04-18T07:32:23Z

Summary

Round-2 standardization pass. Additive; no breaking changes.

Changes

New: `examples/humanizer-skill-schedule/` (GOLD)

Real completed run from the humanizer-skill build (merged). 21 tasks across 9 author-defined batches, which the scheduler reshapes into 12 dependency-aware batches. 4 files:

`README.md` — explains GOLD labeling and characteristic lessons
`input-task_plan.md` — verbatim snapshot of the real plan
`output-schedule.json` — full DAG output (tasks_extracted + batches + per-batch metadata + summary)
`output-handoff-snippet.json` — how the same schedule lives inside triadev-handoff.json

Demonstrates:

Fan-out: T7 (merge 32 patterns, complexity 7) unlocks 5 downstream references
Fan-in: T13 (SKILL.md draft) requires all 5 reference files
Within-batch complexity ordering: T1(1), T2(1), T16(1), T6(4) in batch 1
Critical path length 12, total complexity sum 35, max parallelism 5

Hardened: `evals/evals.json` (4 → 8 cases)

New test cases:

batch-ordering-complexity-01: verifies within-batch tasks ordered ascending by complexity
cross-session-persistence-01: CST 00:00 migration behavior — completed stays, running+pending migrate
standalone-mode-01: works without triadev-handoff.json; reads task_plan.md directly
dynamic-insertion-01: mid-execution insertion re-schedules only non-completed tasks

Assertion types shifted toward determinism. Introduced `sequence_order` (for batch ordering) and `json_path_length`. Now ~80% deterministic vs ~50% in v3.1.

Rewritten

`README.md` aligned with sister-skill style. What's New in v3.2 table, Working Example callout, changelog entry.

Test plan

Run all 8 eval cases; confirm expected outputs
Smoke test on the humanizer-skill input task_plan.md; verify output matches `examples/humanizer-skill-schedule/output-schedule.json`
Verify batch ordering: tasks with identical dependencies sort ascending by complexity

Part of Round-2 wave

Companion PRs:

v3.1: Round-2 standardization — contracts/, phase-transitions, examples, hardened evals triadev#2 (v3.1 structural refinement)
v2.0: Sidecar JSON + Rubber-Stamp Guards + Devil's Advocate (4 → 18 files) value-first-gate#2 (v2.0 sidecar + rubber-stamp guards)
Charpup/openclaw-tdd-sdd-skill (v3.2 evals hardening — PR to follow)

🤖 Generated with Claude Code

Round-2 standardization pass. All additions additive; no breaking changes. ## Changes ### examples/humanizer-skill-schedule/ (new) Real completed run from the humanizer-skill build (PR blader/humanizer#94 merged). 21 tasks across 9 author-defined batches, which the scheduler reshapes into 12 dependency-aware batches. 4 files: - README.md: explains GOLD labeling and characteristic lessons - input-task_plan.md: verbatim snapshot of the real plan - output-schedule.json: full DAG output with tasks_extracted, batches, batch_metadata (parallelism + max complexity per batch + notes), summary (critical path, complexity sum, max parallelism) - output-handoff-snippet.json: how the same schedule lives inside triadev-handoff.json (excerpt, planning + scheduling sections only) Demonstrates fan-out (T7 → 5 downstream refs), fan-in (T13 ← 5 refs), within-batch complexity ordering (T1=1, T2=1, T16=1, T6=4 in batch 1). ### evals/evals.json (4 → 8 cases) New coverage: - batch-ordering-complexity-01: verifies within-batch tasks are ordered ascending by complexity (no-dep tasks only) - cross-session-persistence-01: CST 00:00 migration — completed tasks stay, running + pending get "migrated" status and carried over - standalone-mode-01: task-workflow works without triadev-handoff.json, reading task_plan.md directly; must NOT create triadev-handoff.json - dynamic-insertion-01: mid-execution insertion of a new task; only remaining (non-completed) tasks get re-scheduled Assertion types shifted: - Added sequence_order type for batch ordering verification - Added json_path_length for assertion on batches array size - ~80% of assertions now deterministic (file_exists, json_path_*, sequence_order, contains_all/any, does_not_contain) vs ~20% llm_judge (was 50/50 in v3.1) ### README.md Full rewrite aligning with sister-skill docs. Adds What's New in v3.2 table, Working Example callout, changelog v3.2 entry. ## Verification - 3 new JSON files parse cleanly - Example schedule consistent with SKILL.md's batch-ordering rules

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c6b0235e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T07:36:08Z

+      {"id": "T7",  "name": "构建 32 模式合并分类体系",                    "complexity": 7, "dependencies": ["T4", "T5", "T6"]},
+      {"id": "T13", "name": "SKILL.md 主体",                              "complexity": 5, "dependencies": ["T8", "T9", "T10", "T11", "T12"]}


Keep handoff snippet DAG self-consistent

This tasks_extracted sample is internally invalid: T7 and T13 depend on IDs (T4/T5/T6, T8–T12) that are not present in the same array, while scheduling.batches also references many omitted IDs. Because the scheduler requires every dependency ID to exist, this GOLD example cannot be replayed as a valid handoff fixture and can fail validation for users who copy it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-18T07:36:08Z

+    "total_tasks": 21,
+    "total_batches": 12,
+    "critical_path": ["T1", "T3", "T4", "T7", "T8", "T13", "T14", "T17", "T18", "T19", "T20", "T21"],
+    "critical_path_complexity_sum": 35,


Recompute reported critical-path complexity sum

The critical_path_complexity_sum is inconsistent with the provided data: summing complexities for the listed critical_path (T1,T3,T4,T7,T8,T13,T14,T17,T18,T19,T20,T21) yields 33, not 35. This makes the GOLD reference metrics inaccurate and can mislead downstream validation or documentation that trusts this summary field.

Useful? React with 👍 / 👎.

…lexity sum (33, not 35) Addresses Codex review on PR #2. ## P2 — examples/humanizer-skill-schedule/output-handoff-snippet.json Previously the snippet's planning.tasks_extracted had only 5 tasks (T1, T2, T3, T7, T13) while scheduling.batches and task.dependencies referenced IDs T4, T5, T6, T8–T12, T14–T21. That made the excerpt internally invalid as a handoff fixture — a scheduler consuming this snippet would fail on missing dependency IDs. Fix: inlined the full 21-task tasks_extracted array. The snippet is now self-consistent: every ID in dependencies and batches exists in tasks_extracted. Updated the _comment to note this. ## P2 — examples/humanizer-skill-schedule/output-schedule.json:59 critical_path_complexity_sum was 35 but recomputing against the listed critical_path gives 33: T1(1) + T3(1) + T4(3) + T7(7) + T8(5) + T13(5) + T14(3) + T17(2) + T18(2) + T19(2) + T20(1) + T21(1) = 33 Fixed to 33. Also updated README.md which mentions the number twice ("complexity sum 35" → "complexity sum 33") so human readers and machine readers agree.

Charpup mentioned this pull request Apr 18, 2026

v3.2: Hardened evals — brownfield init, coverage gate, state-machine enforcement Charpup/openclaw-tdd-sdd-skill#3

Merged

3 tasks

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Charpup merged commit 56c8bec into main Apr 18, 2026

Charpup deleted the feat/v3.2-evals-and-examples branch April 18, 2026 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.2: Hardened evals + GOLD example (humanizer-skill schedule)#2

v3.2: Hardened evals + GOLD example (humanizer-skill schedule)#2
Charpup merged 2 commits intomainfrom
feat/v3.2-evals-and-examples

Charpup commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		{"id": "T7", "name": "构建 32 模式合并分类体系", "complexity": 7, "dependencies": ["T4", "T5", "T6"]},
		{"id": "T13", "name": "SKILL.md 主体", "complexity": 5, "dependencies": ["T8", "T9", "T10", "T11", "T12"]}

Conversation

Charpup commented Apr 18, 2026

Summary

Changes

New: `examples/humanizer-skill-schedule/` (GOLD)

Hardened: `evals/evals.json` (4 → 8 cases)

Rewritten

Test plan

Part of Round-2 wave

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant