Skip to content

feat: support Gemini compare usage capture#19

Merged
mohanagy merged 18 commits intomainfrom
feature/gemini-compare-usage
Apr 27, 2026
Merged

feat: support Gemini compare usage capture#19
mohanagy merged 18 commits intomainfrom
feature/gemini-compare-usage

Conversation

@mohanagy
Copy link
Copy Markdown
Owner

@mohanagy mohanagy commented Apr 27, 2026

Summary

  • port the compare usage baseline needed for Gemini work and preserve safe answer-artifact fallback behavior
  • capture Gemini provider-reported usage from structured JSON, including multipart answer assembly and strict fallback behavior
  • document the correct Gemini compare invocation and clarify when compare reports real usage vs labeled estimates

Test Plan

  • npm run typecheck
  • npm run test:run
  • npm run build

Summary by CodeRabbit

  • New Features

    • Enhanced retrieval ranking with relation-aware expansion for improved context matching accuracy
    • Improved token usage tracking in compare mode with real provider-reported usage from Gemini and Claude, falling back to local estimates when unavailable
  • Documentation

    • Added Gemini compare command examples and clarified token reporting semantics
    • Updated guidance on how provider usage is captured and reported in results
  • Tests

    • Added comprehensive test coverage for token handling and retrieval ranking behavior

mohanagy and others added 18 commits April 27, 2026 13:13
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The PR enhances the retrieval ranking system with relation-aware expansion and evidence-based scoring, implements structured parsing of provider-reported token usage (Gemini and Claude) in the compare workflow, and extends retrieval benchmarks with new gold-standard questions and stricter evaluation metrics. Documentation is updated to reflect these changes and clarify token reporting semantics.

Changes

Cohort / File(s) Summary
Documentation & Changelog
CHANGELOG.md, README.md, docs/proof-workflows.md, examples/why-graphify.md
Added changelog entries for retrieval ranking improvements and evaluation guardrails. Updated README and proof documentation to describe Gemini/Claude compare runner patterns, token usage capture behavior (provider-reported vs. local estimates), and report.json inclusion of usageMetadata or fallback cl100k_base estimates.
Benchmark Quality
src/infrastructure/benchmark/quality.ts, tests/unit/benchmark-quality.test.ts
Extended GOLD_QUESTIONS constant with 3 new retrieval benchmark entries targeting retrieveContext and scoreNode labels. Added unit tests validating MRR scoring and recall metrics with tight result limits.
Compare Execution & Parsing
src/infrastructure/compare.ts, tests/unit/compare.test.ts
Implemented structured stdout JSON parsing to extract answer text and provider-reported token usage (Claude usage and Gemini usageMetadata). Added ComparePromptTokenSource type and extended ComparePromptUsage/ComparePromptReport interfaces with usage metadata, token reduction ratios, and source labeling. Extensive unit tests validate Claude/Gemini parsing, fallback behavior, and summary reporting.
Retrieval Ranking Algorithm
src/runtime/retrieve.ts, tests/unit/retrieve.test.ts
Refactored seed scoring and expansion from boost model to explicit evidence breakdown with exact-label matching, TF-IDF token overlap, source-path similarity, and community-label similarity. Implemented relation-aware multi-hop expansion that propagates weighted scores across incident neighbors and upgrades relevance bands. Updated node selection comparators and added comprehensive graph-building and traversal tests.

Sequence Diagram(s)

sequenceDiagram
    actor Runner as Compare Runner
    participant Parser as Compare Parser
    participant Report as Report Writer
    participant Summary as Summary Formatter

    Runner->>Parser: emit stdout (JSON or text)
    alt Structured JSON (Claude/Gemini)
        Parser->>Parser: extract answer text & usage metadata
        Note over Parser: Claude: usage field<br/>Gemini: usageMetadata
        Parser->>Report: write answer text to artifact
        Parser->>Report: record usage (input/total tokens,<br/>source label)
    else Plain text or malformed JSON
        Parser->>Parser: treat as plain answer text
        Parser->>Report: write text to artifact
        Parser->>Report: mark usage as null<br/>(fallback to cl100k_base)
    end
    Report->>Summary: sync prompt tokens & reduction ratios<br/>from captured usage
    Summary->>Summary: format output with token deltas<br/>and source labels
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 Hops of joy through ranking's light,
Relations guide each node just right,
Tokens parsed from Gemini's stream,
Evidence tiers fulfill the dream,
Benchmarks tightened, proofs align—
The proof of work, now crystal-line! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: support Gemini compare usage capture' directly and specifically summarizes the main objective of the pull request, which is adding Gemini support to the compare runtime's usage capture functionality.
Description check ✅ Passed The pull request description covers the key changes (usage capture, Gemini support, documentation) and includes a test plan with all three required checks (typecheck, test:run, build) marked as completed, though it does not fully follow the template structure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/gemini-compare-usage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mohanagy mohanagy merged commit aaea0ee into main Apr 27, 2026
11 of 12 checks passed
@mohanagy mohanagy deleted the feature/gemini-compare-usage branch April 27, 2026 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant