mohanagy · mohanagy · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026
@@ -4,6 +4,11 @@ All notable changes to the TypeScript package will be documented in this file.
 
 ## [Unreleased]
 
+### Improved
+
+- **Retrieval quality**: improved retrieval ranking with relation-aware expansion so connected evidence surfaces more effectively, and strengthened recall/MRR eval guardrails to prevent misleading benchmark results
+- **Gemini compare docs**: documented the stdin-safe Gemini JSON runner (`cat {prompt_file} | gemini -p "" --output-format json`), clarified that `compare` uses reported Gemini/Claude usage when structured JSON includes it, falls back to labeled local estimates otherwise, and that `benchmark`/`eval` remain offline estimate surfaces
+
 ## [0.8.7] - 2026-04-27
 
 ### Changed

@@ -79,16 +79,25 @@ node dist/src/cli/bin.js compare "How does login create a session?" \
   --yes
 ```
 
+Gemini-safe installed-CLI invocation:
+
+```bash
+graphify-ts compare "How does auth work?" \
+  --exec 'cat {prompt_file} | gemini -p "" --output-format json' \
+  --yes
+```
+
 What `compare` does:
 
 - Prints a warning before execution because it may consume paid model tokens. Use `--yes` for non-interactive runs and CI.
 - Expands runner placeholders: `{prompt_file}`, `{question}`, `{mode}`, and `{output_file}`.
 - For large prompts, pass `{prompt_file}` through stdin or file redirection. Avoid shell command substitution around `{prompt_file}` (for example `$(cat {prompt_file})`), which can hit OS argument-length limits.
 - Writes a proof bundle under `graphify-out/compare/<timestamp>/` with `baseline-prompt.txt`, `graphify-prompt.txt`, `baseline-answer.txt`, `graphify-answer.txt`, and `report.json`.
-- Reports prompt-token counts as local `cl100k_base` estimates, not provider billing tokens.
+- Promotes provider-reported usage into `report.json` and the terminal summary when the runner emits structured JSON with usage (for Gemini, `usageMetadata` from `--output-format json`; for Claude, structured JSON with `usage`).
+- Falls back to labeled local `cl100k_base` prompt estimates when the runner only returns answer text or malformed JSON, so the token source stays explicit.
 - Preserves partial artifacts when one side fails, and classifies prompt-size failures such as `Prompt is too long` as `context_overflow` evidence in `report.json`.
 
-Use `compare` when you want a showcase or a customer-proof run. Use `benchmark` and `eval` when you want repeatable local measurements without calling a model.
+Use `compare` when you want a showcase or a customer-proof run. Use `benchmark` and `eval` when you want repeatable local measurements without calling a model; they remain offline estimate surfaces rather than provider-reported usage surfaces.
 
 ## Graph time travel (ref-to-ref graph compare)
 

@@ -32,6 +32,14 @@ node dist/src/cli/bin.js compare "How does login create a session?" \
   --yes
 ```
 
+Gemini-safe installed-CLI invocation:
+
+```bash
+graphify-ts compare "How does auth work?" \
+  --exec 'cat {prompt_file} | gemini -p "" --output-format json' \
+  --yes
+```
+
 What gets saved under `graphify-out/compare/<timestamp>/`:
 
 - `baseline-prompt.txt`
@@ -40,7 +48,7 @@ What gets saved under `graphify-out/compare/<timestamp>/`:
 - `graphify-answer.txt`
 - `report.json`
 
-Use this when you need customer-proof or your own apples-to-apples answer comparison. It can spend paid model tokens, so it is intentionally separate from the local benchmark/eval path.
+When Gemini emits structured JSON with `usageMetadata`, `compare` captures real reported input and total tokens in `report.json` and the terminal summary. If the runner only returns answer text or malformed JSON, `compare` falls back to labeled local `cl100k_base` prompt estimates instead. Use this when you need customer-proof or your own apples-to-apples answer comparison. It can spend paid model tokens, so it is intentionally separate from the local benchmark/eval path. `benchmark` and `eval` remain offline estimate surfaces.
 
 ## 3. Production and multi-repo proof
 
@@ -78,7 +86,7 @@ What this proves that a single-repo demo cannot:
 |---|---|
 | "Does the graph improve retrieval quality on a labeled set?" | `eval` |
 | "Does the graph reduce prompt size while keeping expected evidence?" | `benchmark` |
-| "Will my actual model answer better with graphify than with a naive baseline?" | `compare` |
+| "Will my actual model answer better with graphify than with a naive baseline, and optionally capture provider-reported usage?" | `compare` |
 | "Can this work across frontend/backend/shared repos?" | `federate` + `serve --stdio` |
 
 For the narrative production benchmark and the GoValidate numbers, see [`examples/why-graphify.md`](../examples/why-graphify.md). For exact support coverage by language and file type, see [`language-capability-matrix.md`](./language-capability-matrix.md).
@@ -141,14 +141,22 @@ node dist/src/cli/bin.js compare "How does login create a session?" \
   --yes
 ```
 
+Gemini-safe installed-CLI invocation:
+
+```bash
+graphify-ts compare "How does auth work?" \
+  --exec 'cat {prompt_file} | gemini -p "" --output-format json' \
+  --yes
+```
+
 What this gives you:
 
 - one baseline prompt and one graphify prompt for the same question
 - two real model answers from your own terminal runner
 - a saved proof bundle in `graphify-out/compare/<timestamp>/`
-- prompt-token counts and run statuses in `report.json`
+- prompt-token counts, usage-source labels, and run statuses in `report.json`
 
-Important: `compare` may spend paid model tokens. It prints a warning before execution and requires `--yes` in non-interactive runs. For large prompts, use stdin or file redirection with `{prompt_file}`; avoid shell command substitution around `{prompt_file}` (for example `$(cat {prompt_file})`) because shell argument expansion can fail on full-repo baselines.
+Important: `compare` may spend paid model tokens. It prints a warning before execution and requires `--yes` in non-interactive runs. For large prompts, use stdin or file redirection with `{prompt_file}`; avoid shell command substitution around `{prompt_file}` (for example `$(cat {prompt_file})`) because shell argument expansion can fail on full-repo baselines. If Gemini emits structured JSON with `usageMetadata`, `compare` records real reported input and total tokens. If the runner only returns answer text or malformed JSON, `compare` falls back to labeled local `cl100k_base` prompt estimates instead. `benchmark` and `eval` stay offline estimate surfaces.
 
 ## Run It on Your Own Codebase
 
@@ -168,6 +176,11 @@ graphify-ts eval graphify-out/graph.json --questions benchmark-questions.json
 # If you want a real same-model A/B proof run
 graphify-ts compare "How does auth work?" --exec 'cat {prompt_file} | claude -p' --yes
 
+# Gemini-safe compare runner with structured usage capture
+graphify-ts compare "How does auth work?" \
+  --exec 'cat {prompt_file} | gemini -p "" --output-format json' \
+  --yes
+
 # Set up your AI agent
 graphify-ts claude install    # writes .mcp.json with MCP server
 graphify-ts cursor install    # writes .cursor/mcp.json
@@ -187,7 +200,7 @@ For an internal team rollout, the most convincing sequence is usually:
 That progression keeps the proof honest:
 
 - `benchmark` and `eval` are local graph-quality measurements
-- `compare` is the model-facing proof
+- `compare` is the model-facing proof, with reported usage when the runner emits structured JSON and labeled estimates otherwise
 - `federate` is the production architecture proof for frontend/backend/shared or microservice splits
 
 ## Capability Coverage Matters

@@ -55,6 +55,18 @@ export const GOLD_QUESTIONS: GoldQuestion[] = [
     question: 'how does the retrieve MCP tool find relevant nodes',
     expected_labels: ['retrievecontext', 'scorenode'],
   },
+  {
+    question: 'retrieveContext',
+    expected_labels: ['retrievecontext'],
+  },
+  {
+    question: 'how does retrieveContext build community labels',
+    expected_labels: ['retrievecontext', 'buildcommunitylabels'],
+  },
+  {
+    question: 'scoreNode',
+    expected_labels: ['scorenode'],
+  },
   {
     question: 'how does javascript extraction work',
     expected_labels: ['extractjs', 'extractionnode'],