Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR migrates Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI
participant BenchmarkRunner as Benchmark/Quality
participant ExecutionRunner as Execution Runner
participant Model as Model API
User->>CLI: benchmark --exec "..." --yes
CLI->>BenchmarkRunner: confirmPaidCommand (--yes flag)
alt --yes not provided
BenchmarkRunner-->>CLI: throw UsageError
CLI-->>User: exit code 2
end
BenchmarkRunner->>BenchmarkRunner: Load graph & questions
BenchmarkRunner->>BenchmarkRunner: For each labeled question
loop Per Question Execution
BenchmarkRunner->>ExecutionRunner: runBenchmarkPrompt (execTemplate)
ExecutionRunner->>ExecutionRunner: Validate template, build prompt
ExecutionRunner->>ExecutionRunner: Write artifacts (prompt.txt)
ExecutionRunner->>Model: Execute via configured runner
Model-->>ExecutionRunner: stdout (structured or plain)
ExecutionRunner->>ExecutionRunner: parsePromptRunnerOutput
ExecutionRunner-->>BenchmarkRunner: BenchmarkPromptRun (answer, usage, tokens, timing)
end
BenchmarkRunner->>BenchmarkRunner: Aggregate results & compute averages
BenchmarkRunner->>BenchmarkRunner: Build QualityReport with usage metadata
BenchmarkRunner-->>CLI: QualityReport
CLI-->>User: Formatted output with token summaries
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Testing
npm run test:runnpm run typechecknpm run buildnpm pack --dry-run(if packaging or install behavior changed)Checklist
Related issues
Summary by CodeRabbit
Release Notes
New Features
benchmarkandevalcommands now support--execflag for runner-backed prompt execution--yesflag for non-interactive execution of benchmark and eval operationsDocumentation
--execand--yesflagsTests