Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ All notable changes to the TypeScript package will be documented in this file.

## [Unreleased]

### Improved

- **Retrieval quality**: improved retrieval ranking with relation-aware expansion so connected evidence surfaces more effectively, and strengthened recall/MRR eval guardrails to prevent misleading benchmark results
- **Gemini compare docs**: documented the stdin-safe Gemini JSON runner (`cat {prompt_file} | gemini -p "" --output-format json`), clarified that `compare` uses reported Gemini/Claude usage when structured JSON includes it, falls back to labeled local estimates otherwise, and that `benchmark`/`eval` remain offline estimate surfaces

## [0.8.7] - 2026-04-27

### Changed
Expand Down
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,16 +79,25 @@ node dist/src/cli/bin.js compare "How does login create a session?" \
--yes
```

Gemini-safe installed-CLI invocation:

```bash
graphify-ts compare "How does auth work?" \
--exec 'cat {prompt_file} | gemini -p "" --output-format json' \
--yes
```

What `compare` does:

- Prints a warning before execution because it may consume paid model tokens. Use `--yes` for non-interactive runs and CI.
- Expands runner placeholders: `{prompt_file}`, `{question}`, `{mode}`, and `{output_file}`.
- For large prompts, pass `{prompt_file}` through stdin or file redirection. Avoid shell command substitution around `{prompt_file}` (for example `$(cat {prompt_file})`), which can hit OS argument-length limits.
- Writes a proof bundle under `graphify-out/compare/<timestamp>/` with `baseline-prompt.txt`, `graphify-prompt.txt`, `baseline-answer.txt`, `graphify-answer.txt`, and `report.json`.
- Reports prompt-token counts as local `cl100k_base` estimates, not provider billing tokens.
- Promotes provider-reported usage into `report.json` and the terminal summary when the runner emits structured JSON with usage (for Gemini, `usageMetadata` from `--output-format json`; for Claude, structured JSON with `usage`).
- Falls back to labeled local `cl100k_base` prompt estimates when the runner only returns answer text or malformed JSON, so the token source stays explicit.
- Preserves partial artifacts when one side fails, and classifies prompt-size failures such as `Prompt is too long` as `context_overflow` evidence in `report.json`.

Use `compare` when you want a showcase or a customer-proof run. Use `benchmark` and `eval` when you want repeatable local measurements without calling a model.
Use `compare` when you want a showcase or a customer-proof run. Use `benchmark` and `eval` when you want repeatable local measurements without calling a model; they remain offline estimate surfaces rather than provider-reported usage surfaces.

## Graph time travel (ref-to-ref graph compare)

Expand Down
12 changes: 10 additions & 2 deletions docs/proof-workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,14 @@ node dist/src/cli/bin.js compare "How does login create a session?" \
--yes
```

Gemini-safe installed-CLI invocation:

```bash
graphify-ts compare "How does auth work?" \
--exec 'cat {prompt_file} | gemini -p "" --output-format json' \
--yes
```

What gets saved under `graphify-out/compare/<timestamp>/`:

- `baseline-prompt.txt`
Expand All @@ -40,7 +48,7 @@ What gets saved under `graphify-out/compare/<timestamp>/`:
- `graphify-answer.txt`
- `report.json`

Use this when you need customer-proof or your own apples-to-apples answer comparison. It can spend paid model tokens, so it is intentionally separate from the local benchmark/eval path.
When Gemini emits structured JSON with `usageMetadata`, `compare` captures real reported input and total tokens in `report.json` and the terminal summary. If the runner only returns answer text or malformed JSON, `compare` falls back to labeled local `cl100k_base` prompt estimates instead. Use this when you need customer-proof or your own apples-to-apples answer comparison. It can spend paid model tokens, so it is intentionally separate from the local benchmark/eval path. `benchmark` and `eval` remain offline estimate surfaces.

## 3. Production and multi-repo proof

Expand Down Expand Up @@ -78,7 +86,7 @@ What this proves that a single-repo demo cannot:
|---|---|
| "Does the graph improve retrieval quality on a labeled set?" | `eval` |
| "Does the graph reduce prompt size while keeping expected evidence?" | `benchmark` |
| "Will my actual model answer better with graphify than with a naive baseline?" | `compare` |
| "Will my actual model answer better with graphify than with a naive baseline, and optionally capture provider-reported usage?" | `compare` |
| "Can this work across frontend/backend/shared repos?" | `federate` + `serve --stdio` |

For the narrative production benchmark and the GoValidate numbers, see [`examples/why-graphify.md`](../examples/why-graphify.md). For exact support coverage by language and file type, see [`language-capability-matrix.md`](./language-capability-matrix.md).
19 changes: 16 additions & 3 deletions examples/why-graphify.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,14 +141,22 @@ node dist/src/cli/bin.js compare "How does login create a session?" \
--yes
```

Gemini-safe installed-CLI invocation:

```bash
graphify-ts compare "How does auth work?" \
--exec 'cat {prompt_file} | gemini -p "" --output-format json' \
--yes
```

What this gives you:

- one baseline prompt and one graphify prompt for the same question
- two real model answers from your own terminal runner
- a saved proof bundle in `graphify-out/compare/<timestamp>/`
- prompt-token counts and run statuses in `report.json`
- prompt-token counts, usage-source labels, and run statuses in `report.json`

Important: `compare` may spend paid model tokens. It prints a warning before execution and requires `--yes` in non-interactive runs. For large prompts, use stdin or file redirection with `{prompt_file}`; avoid shell command substitution around `{prompt_file}` (for example `$(cat {prompt_file})`) because shell argument expansion can fail on full-repo baselines.
Important: `compare` may spend paid model tokens. It prints a warning before execution and requires `--yes` in non-interactive runs. For large prompts, use stdin or file redirection with `{prompt_file}`; avoid shell command substitution around `{prompt_file}` (for example `$(cat {prompt_file})`) because shell argument expansion can fail on full-repo baselines. If Gemini emits structured JSON with `usageMetadata`, `compare` records real reported input and total tokens. If the runner only returns answer text or malformed JSON, `compare` falls back to labeled local `cl100k_base` prompt estimates instead. `benchmark` and `eval` stay offline estimate surfaces.

## Run It on Your Own Codebase

Expand All @@ -168,6 +176,11 @@ graphify-ts eval graphify-out/graph.json --questions benchmark-questions.json
# If you want a real same-model A/B proof run
graphify-ts compare "How does auth work?" --exec 'cat {prompt_file} | claude -p' --yes

# Gemini-safe compare runner with structured usage capture
graphify-ts compare "How does auth work?" \
--exec 'cat {prompt_file} | gemini -p "" --output-format json' \
--yes

# Set up your AI agent
graphify-ts claude install # writes .mcp.json with MCP server
graphify-ts cursor install # writes .cursor/mcp.json
Expand All @@ -187,7 +200,7 @@ For an internal team rollout, the most convincing sequence is usually:
That progression keeps the proof honest:

- `benchmark` and `eval` are local graph-quality measurements
- `compare` is the model-facing proof
- `compare` is the model-facing proof, with reported usage when the runner emits structured JSON and labeled estimates otherwise
- `federate` is the production architecture proof for frontend/backend/shared or microservice splits

## Capability Coverage Matters
Expand Down
12 changes: 12 additions & 0 deletions src/infrastructure/benchmark/quality.ts
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,18 @@ export const GOLD_QUESTIONS: GoldQuestion[] = [
question: 'how does the retrieve MCP tool find relevant nodes',
expected_labels: ['retrievecontext', 'scorenode'],
},
{
question: 'retrieveContext',
expected_labels: ['retrievecontext'],
},
{
question: 'how does retrieveContext build community labels',
expected_labels: ['retrievecontext', 'buildcommunitylabels'],
},
{
question: 'scoreNode',
expected_labels: ['scorenode'],
},
{
question: 'how does javascript extraction work',
expected_labels: ['extractjs', 'extractionnode'],
Expand Down
Loading