Clean experiment artifacts for a narrow code-security distillation study.
This repo packages the subset of code, data, benchmark outputs, and notes from a larger research workspace, centered on the strongest result from the project: a narrow benchmark where a distilled 7B model slightly outperformed GPT-5.2.
The main result here is simple:
- a
7Bopen model slightly beatGPT-5.2on a real-world security benchmark - the win came from task narrowing, public matched data, and frontier-model distillation
- the best student was still small enough to be a practical specialist, not a general replacement for frontier models
The benchmark task was:
- C/C++ numeric vulnerability triage
- only
CWE-190andCWE-191 - strict structured output:
vulnerablesubtypelocationreason
This is not a claim about general vulnerability detection or patch generation. It is a claim about a specific, repeated workflow where specialization mattered.
All models below were evaluated on the same frozen 140-example PrimeVul test set:
20true numeric vulnerabilities120negatives / distractors
Because the benchmark is negative-heavy, the most useful metric is:
balanced binary accuracy = (positive recall + negative accuracy) / 2
| Model | Balanced Binary Acc | Positive Recall | Negative Accuracy | Read |
|---|---|---|---|---|
Qwen + Juliet -> PrimeVul distilled |
73.8% | 95.0% | 52.5% | Best recall, most aggressive |
GPT-5.2 |
70.8% | 85.0% | 56.7% | Strong frontier baseline |
Qwen + PrimeVul distilled |
70.0% | 85.0% | 55.0% | Roughly tied with GPT-5.2 |
Qwen base |
63.8% | 30.0% | 97.5% | Very conservative |
Qwen + Juliet stage 1 |
50.0% | 0.0% | 100.0% | Learned to always say NONE |
Short version:
- the best
7Bstudent slightly beatGPT-5.2on balanced accuracy PrimeVul + GPT-5.2-distilled targetscreated the real liftJulietmay provide a small warm-start benefit before the real-world distilled stage
This repo supports a result that is stronger than "small models can be decent" and narrower than "small models beat frontier models at security":
- a small open model can become competitive with, and slightly outperform, a frontier model on a narrow code-security workflow
- public data was enough to get there
- distillation plus task matching mattered more than raw dataset size
- the useful end state is a cheaper specialist, not a better general reasoner
The important pattern is:
- broader task scopes are harder to move with small fine-tunes
- narrow real-world distillation worked
- specialization is the lever
Juliet alonewas not enough to produce the strongest resultPrimeVulreal-world examples plusGPT-5.2-distilled structured targets didJuliet -> PrimeVul distilledwas the best run, but the improvement overPrimeVul distilledwas small- the best student won mostly by increasing positive recall, not by becoming uniformly more accurate
So the core lesson is not "synthetic security data is enough." The core lesson is that a small model can inherit useful frontier behavior when the task, labels, and evaluation are all tightly aligned.
These caveats matter, but they do not erase the result.
-
This is a narrow task, not general vulnerability detection.
-
This is distillation, not independent general reasoning.
- the best student models were trained on
GPT-5.2-generated targets from the same task family
- the best student models were trained on
-
The benchmark has no exact train/test overlap:
- task ID overlap:
0 - commit overlap:
0 - exact prompt overlap:
0 - exact code overlap:
0
- task ID overlap:
-
There are still weaker overlap risks:
- same public corpus family (
PrimeVul) - same-project overlap
- some shared CVEs between train and eval
- same public corpus family (
-
The best-performing student is more aggressive than the base model.
- it catches more real positives
- it also produces more false positives
The right interpretation is:
- this is real evidence that a small open model can win on a narrow, structured workflow
- it is not evidence that a small model has better general reasoning than a frontier model
I also explored broader vulnerability-detection setups, but this repo is intentionally centered on the narrower benchmark where the result was clearest and most interesting:
- a fixed real-world eval set
- a tightly scoped task
- public matched data
- distilled structured targets
That is the setup that produced the most believable frontier-vs-small-model comparison in this project.
-
EXPERIMENT_NOTES_2026-03-26.mdFull writeup with methodology, result tables, contamination checks, and interpretation. -
benchmarks/Saved benchmark outputs for the cleaned broad benchmark and the numeric-triage runs. -
data/Frozen eval sets, manifests, and numeric training datasets. -
scripts/Dataset builders, distillation scripts, Modal train/eval entrypoints, and the numeric rebalancer. -
src/rl_secdef/The subset of package code needed to build and benchmark the cleaned experiments. -
tests/Focused tests for the numeric-triage pipeline.
benchmarks/qwen7b_primevul_numeric_base_eval.jsonbenchmarks/qwen7b_juliet_numeric_stage1_eval.jsonbenchmarks/gpt52_primevul_numeric_eval.jsonbenchmarks/qwen7b_primevul_numeric_distilled_eval.jsonbenchmarks/qwen7b_juliet_primevul_numeric_distilled_eval.json
data/primevul_numeric_triage_train.jsonldata/primevul_numeric_triage_train_distilled.jsonldata/primevul_numeric_triage_eval.jsonldata/primevul_numeric_triage_train.manifest.jsondata/juliet_numeric_triage.jsonldata/juliet_numeric_triage.manifest.json
scripts/build_primevul_numeric_triage.pyscripts/build_juliet_numeric_triage.pyscripts/distill_numeric_triage.pyscripts/modal_train_detect.pyscripts/modal_eval_numeric.pyscripts/rebalance_numeric_triage.pysrc/rl_secdef/data/primevul_numeric.pysrc/rl_secdef/runner/numeric_triage.pysrc/rl_secdef/benchmark_numeric.py
Focused tests:
python3 -m pytest tests/test_numeric_triage.py tests/test_primevul_numeric.py tests/test_rebalance_numeric.py -qThe extracted numeric experiment bundle is intentionally small. This repo is not meant to be a full mirror of the original workspace.
Read:
That file contains the full methodology, result tables, contamination checks, and interpretation.