Feature/runner backed benchmark eval by mohanagy · Pull Request #22 · mohanagy/graphify-ts

mohanagy · 2026-04-28T18:41:25Z

Summary

Testing

npm run test:run
npm run typecheck
npm run build
npm pack --dry-run (if packaging or install behavior changed)

Checklist

I updated docs for any user-visible change
I added or updated tests when behavior changed
I did not commit secrets, private corpora, or accidental generated artifacts
I kept this PR focused on a single change or tightly related set of changes

Related issues

Summary by CodeRabbit

Release Notes

New Features
- benchmark and eval commands now support --exec flag for runner-backed prompt execution
- Added --yes flag for non-interactive execution of benchmark and eval operations
- Token usage reporting now available for benchmark and eval results
Documentation
- Updated examples and guidance to reflect runner-backed execution via --exec and --yes flags
- CLI help text revised to document new command options and runner dependencies
Tests
- Expanded test coverage for runner-backed execution paths and token usage parsing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai · 2026-04-28T18:41:37Z

📝 Walkthrough

Walkthrough

This PR migrates benchmark and eval commands from offline-only graph measurements to runner-backed executions that invoke configured model runners via a new --exec flag. Documentation, CLI parsing, core benchmark infrastructure, and tests are updated to support this execution model with enriched token usage reporting.

Changes

Cohort / File(s)	Summary
Documentation & Examples `CHANGELOG.md`, `README.md`, `docs/proof-workflows.md`, `examples/demo-repo/README.md`, `examples/quick-benchmark.sh`, `examples/why-graphify.md`	Updated to document runner-backed execution via `--exec` flag and `--yes` for non-interactive runs; revised descriptions to emphasize shared runner surface between `benchmark`, `eval`, and `compare`; updated example commands and CI guidance accordingly.
CLI & Parser `src/cli/main.ts`, `src/cli/parser.ts`	Added `BenchmarkCommandContext` and `EvalCommandContext` types; refactored `runBenchmark` to accept context object; added `runEval` dependency with graph/question loading and quality evaluation; implemented `confirmPaidCommand` gating logic enforcing `--yes` in non-interactive mode; updated parsers to require and parse `--exec TEMPLATE` and `--yes` flags; revised help text to document paid runner behavior.
Benchmark Infrastructure `src/infrastructure/benchmark.ts`, `src/infrastructure/benchmark/quality.ts`, `src/infrastructure/benchmark/questions.ts`	Extended `runBenchmark` with optional async runner-backed path returning `Promise<BenchmarkResult>` with execution metadata; added `BenchmarkRunOptions` and updated `BenchmarkSuccessResult` to include token fields; refactored `evaluateRetrievalQuality` with overloads for sync retrieval-only and async runner-backed modes; enriched question results with runtime metadata (tokens, usage, answer text, elapsed time, artifacts).
New Runner & Parsing Modules `src/infrastructure/benchmark/runner.ts`, `src/infrastructure/benchmark/usage.ts`, `src/infrastructure/prompt-runner.ts`	Created new modules for end-to-end prompt execution, token accounting, and usage reporting; `runner.ts` orchestrates prompt execution with artifact capture and default subprocess spawning; `usage.ts` provides helpers for token metrics and provider-aware labeling; `prompt-runner.ts` defines unified stdout parsing supporting Claude/Gemini structured JSON formats with fallback to plain text.
Compare Integration `src/infrastructure/compare.ts`	Unified token usage typing by replacing local `ComparePromptUsage` with `PromptRunnerUsage` alias; delegated stdout parsing to shared `parsePromptRunnerOutput` from `prompt-runner.ts` module.
Unit Tests `tests/unit/benchmark-quality.test.ts`, `tests/unit/benchmark.test.ts`, `tests/unit/cli.test.ts`, `tests/unit/compare.test.ts`	Added async test support in `withTempDir` helper; added comprehensive runner-backed benchmark and quality tests validating prompt execution, artifact generation, and token aggregation; refactored CLI tests for context-based dependency injection; added direct coverage for structured output parsing across Claude, Gemini, and fallback formats.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI
    participant BenchmarkRunner as Benchmark/Quality
    participant ExecutionRunner as Execution Runner
    participant Model as Model API
    
    User->>CLI: benchmark --exec "..." --yes
    CLI->>BenchmarkRunner: confirmPaidCommand (--yes flag)
    alt --yes not provided
        BenchmarkRunner-->>CLI: throw UsageError
        CLI-->>User: exit code 2
    end
    
    BenchmarkRunner->>BenchmarkRunner: Load graph & questions
    BenchmarkRunner->>BenchmarkRunner: For each labeled question
    
    loop Per Question Execution
        BenchmarkRunner->>ExecutionRunner: runBenchmarkPrompt (execTemplate)
        ExecutionRunner->>ExecutionRunner: Validate template, build prompt
        ExecutionRunner->>ExecutionRunner: Write artifacts (prompt.txt)
        ExecutionRunner->>Model: Execute via configured runner
        Model-->>ExecutionRunner: stdout (structured or plain)
        ExecutionRunner->>ExecutionRunner: parsePromptRunnerOutput
        ExecutionRunner-->>BenchmarkRunner: BenchmarkPromptRun (answer, usage, tokens, timing)
    end
    
    BenchmarkRunner->>BenchmarkRunner: Aggregate results & compute averages
    BenchmarkRunner->>BenchmarkRunner: Build QualityReport with usage metadata
    BenchmarkRunner-->>CLI: QualityReport
    CLI-->>User: Formatted output with token summaries

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat: add reproducible demo proof kit #10 — Overlaps in CLI parsing and benchmark/quality infrastructure modifications, sharing refactored benchmark command context and quality evaluation paths.
Feature/workspace parity low cohesion baseline #7 — Related through shared modifications to benchmark infrastructure (src/infrastructure/benchmark.* and question/result type shapes) and overlapping execution path logic.

Poem

🐰 Hop along, the runners come alive,
No more offline estimates to survive!
Claude and Gemini now join the test,
With tokens counted and usage blessed,
Graph quality proven with earned tokens to back,
Benchmarks now run on their measured track! 🏃‍♂️✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is incomplete—the Summary section contains only a comment placeholder with no actual summary of changes and their rationale.	Add a substantive summary explaining what changed (runner-backed benchmark/eval execution), why the change was made, and the key behavioral impacts for users and developers.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Feature/runner backed benchmark eval' is directly related to the main changes, which add runner-backed execution to benchmark and eval commands. It clearly identifies the primary feature being added.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/runner-backed-benchmark-eval

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mohanagy and others added 11 commits April 28, 2026 20:19

test: cover shared prompt runner parsing

1b29948

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

refactor: share prompt runner parsing

7cd283a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

test: codify benchmark eval cli contracts

387bf90

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: require exec for benchmark and eval

bb28082

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

test: thread benchmark eval cli contexts

430d168

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: thread benchmark eval cli contexts

fe83629

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

test: cover runner-backed benchmark execution

87724dd

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add runner-backed benchmark usage

4efc687

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

test: cover runner-backed eval usage

737ff8f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add runner-backed eval usage

c5b4b76

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

docs: clarify runner-backed benchmark eval

94c5be4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: align eval CI runner contract

903070f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mohanagy merged commit 57bf7f0 into main Apr 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/runner backed benchmark eval#22

Feature/runner backed benchmark eval#22
mohanagy merged 12 commits intomainfrom
feature/runner-backed-benchmark-eval

mohanagy commented Apr 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mohanagy commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Checklist

Related issues

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mohanagy commented Apr 28, 2026 •

edited

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading