Skip to content

Feature/runner backed benchmark eval#22

Merged
mohanagy merged 12 commits intomainfrom
feature/runner-backed-benchmark-eval
Apr 28, 2026
Merged

Feature/runner backed benchmark eval#22
mohanagy merged 12 commits intomainfrom
feature/runner-backed-benchmark-eval

Conversation

@mohanagy
Copy link
Copy Markdown
Owner

@mohanagy mohanagy commented Apr 28, 2026

Summary

Testing

  • npm run test:run
  • npm run typecheck
  • npm run build
  • npm pack --dry-run (if packaging or install behavior changed)

Checklist

  • I updated docs for any user-visible change
  • I added or updated tests when behavior changed
  • I did not commit secrets, private corpora, or accidental generated artifacts
  • I kept this PR focused on a single change or tightly related set of changes

Related issues

Summary by CodeRabbit

Release Notes

  • New Features

    • benchmark and eval commands now support --exec flag for runner-backed prompt execution
    • Added --yes flag for non-interactive execution of benchmark and eval operations
    • Token usage reporting now available for benchmark and eval results
  • Documentation

    • Updated examples and guidance to reflect runner-backed execution via --exec and --yes flags
    • CLI help text revised to document new command options and runner dependencies
  • Tests

    • Expanded test coverage for runner-backed execution paths and token usage parsing

mohanagy and others added 11 commits April 28, 2026 20:19
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

This PR migrates benchmark and eval commands from offline-only graph measurements to runner-backed executions that invoke configured model runners via a new --exec flag. Documentation, CLI parsing, core benchmark infrastructure, and tests are updated to support this execution model with enriched token usage reporting.

Changes

Cohort / File(s) Summary
Documentation & Examples
CHANGELOG.md, README.md, docs/proof-workflows.md, examples/demo-repo/README.md, examples/quick-benchmark.sh, examples/why-graphify.md
Updated to document runner-backed execution via --exec flag and --yes for non-interactive runs; revised descriptions to emphasize shared runner surface between benchmark, eval, and compare; updated example commands and CI guidance accordingly.
CLI & Parser
src/cli/main.ts, src/cli/parser.ts
Added BenchmarkCommandContext and EvalCommandContext types; refactored runBenchmark to accept context object; added runEval dependency with graph/question loading and quality evaluation; implemented confirmPaidCommand gating logic enforcing --yes in non-interactive mode; updated parsers to require and parse --exec TEMPLATE and --yes flags; revised help text to document paid runner behavior.
Benchmark Infrastructure
src/infrastructure/benchmark.ts, src/infrastructure/benchmark/quality.ts, src/infrastructure/benchmark/questions.ts
Extended runBenchmark with optional async runner-backed path returning Promise<BenchmarkResult> with execution metadata; added BenchmarkRunOptions and updated BenchmarkSuccessResult to include token fields; refactored evaluateRetrievalQuality with overloads for sync retrieval-only and async runner-backed modes; enriched question results with runtime metadata (tokens, usage, answer text, elapsed time, artifacts).
New Runner & Parsing Modules
src/infrastructure/benchmark/runner.ts, src/infrastructure/benchmark/usage.ts, src/infrastructure/prompt-runner.ts
Created new modules for end-to-end prompt execution, token accounting, and usage reporting; runner.ts orchestrates prompt execution with artifact capture and default subprocess spawning; usage.ts provides helpers for token metrics and provider-aware labeling; prompt-runner.ts defines unified stdout parsing supporting Claude/Gemini structured JSON formats with fallback to plain text.
Compare Integration
src/infrastructure/compare.ts
Unified token usage typing by replacing local ComparePromptUsage with PromptRunnerUsage alias; delegated stdout parsing to shared parsePromptRunnerOutput from prompt-runner.ts module.
Unit Tests
tests/unit/benchmark-quality.test.ts, tests/unit/benchmark.test.ts, tests/unit/cli.test.ts, tests/unit/compare.test.ts
Added async test support in withTempDir helper; added comprehensive runner-backed benchmark and quality tests validating prompt execution, artifact generation, and token aggregation; refactored CLI tests for context-based dependency injection; added direct coverage for structured output parsing across Claude, Gemini, and fallback formats.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI
    participant BenchmarkRunner as Benchmark/Quality
    participant ExecutionRunner as Execution Runner
    participant Model as Model API
    
    User->>CLI: benchmark --exec "..." --yes
    CLI->>BenchmarkRunner: confirmPaidCommand (--yes flag)
    alt --yes not provided
        BenchmarkRunner-->>CLI: throw UsageError
        CLI-->>User: exit code 2
    end
    
    BenchmarkRunner->>BenchmarkRunner: Load graph & questions
    BenchmarkRunner->>BenchmarkRunner: For each labeled question
    
    loop Per Question Execution
        BenchmarkRunner->>ExecutionRunner: runBenchmarkPrompt (execTemplate)
        ExecutionRunner->>ExecutionRunner: Validate template, build prompt
        ExecutionRunner->>ExecutionRunner: Write artifacts (prompt.txt)
        ExecutionRunner->>Model: Execute via configured runner
        Model-->>ExecutionRunner: stdout (structured or plain)
        ExecutionRunner->>ExecutionRunner: parsePromptRunnerOutput
        ExecutionRunner-->>BenchmarkRunner: BenchmarkPromptRun (answer, usage, tokens, timing)
    end
    
    BenchmarkRunner->>BenchmarkRunner: Aggregate results & compute averages
    BenchmarkRunner->>BenchmarkRunner: Build QualityReport with usage metadata
    BenchmarkRunner-->>CLI: QualityReport
    CLI-->>User: Formatted output with token summaries
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 Hop along, the runners come alive,
No more offline estimates to survive!
Claude and Gemini now join the test,
With tokens counted and usage blessed,
Graph quality proven with earned tokens to back,
Benchmarks now run on their measured track! 🏃‍♂️✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is incomplete—the Summary section contains only a comment placeholder with no actual summary of changes and their rationale. Add a substantive summary explaining what changed (runner-backed benchmark/eval execution), why the change was made, and the key behavioral impacts for users and developers.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Feature/runner backed benchmark eval' is directly related to the main changes, which add runner-backed execution to benchmark and eval commands. It clearly identifies the primary feature being added.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/runner-backed-benchmark-eval

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mohanagy mohanagy merged commit 57bf7f0 into main Apr 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant