📄 Paper · 🌐 Project page · 🤗 Datasets · 🏆 Leaderboard
PDB is an automatic pipeline that turns any coding dataset into a debugging benchmark with fine-grained metrics. Beyond binary unit-test scores, PDB evaluates a debugger with edit-level precision (did the model touch only the lines it had to?) and bug-level recall (did it fix every fault?). This rewards targeted fixes and penalizes the regeneration behavior frontier LLMs often fall back on.
- Released datasets:
PDB-Single·PDB-Single-Hard·PDB-Multi
TL;DR — Frontier models like GPT-5.1-Codex and DeepSeek-V3.2-Thinking top unit-test leaderboards (>76%) but score at or below 45% on precision: they pass tests by rewriting, not repairing. PDB makes that gap measurable.
We use uv for reproducible environments.
git clone https://github.com/Bill1235813/PDB
cd PDB
uv sync # creates .venv, installs locked deps
source .venv/bin/activate # optional; scripts already point at .venv/bin/pythonThe LiveCodeBench and BigCodeBench sandboxes live in separate uv envs:
cd dataset/bigcodebench/install && uv sync --extra eval && cd -
cd dataset/livecodebench/install && uv sync && cd -Drop one key file per provider into keys/ (each file is a single line with the raw key). Mapping, local-model setup, and --model_api_file override instructions are in keys/README.md.
Bug-correct + score one model across both BigCodeBench and LiveCodeBench:
bash scripts/simple_debug_eval.sh <subset> <model><subset>∈single,single-hard,multi(points at<bench>_pdb_<subset>.json).<model>is any dspy model string:openai/gpt-5.1-codex,anthropic/claude-sonnet-4-5-20250929,deepseek/deepseek-chat, etc. Local / self-hosted endpoints are supported too — see scripts/README.md.
Example output (Evaluator per-dataset lines + driver union):
[summary] gpt-5.1-codex on bigcodebench_pdb_single_hard round 1: unit=0.733 prec=0.548 rec=0.777 f1=0.602 (n=2510)
[summary] gpt-5.1-codex on livecodebench_pdb_single_hard round 1: unit=0.914 prec=0.465 rec=0.789 f1=0.540 (n=3224)
union unit=0.828 prec=0.500 rec=0.783 f1=0.566 (n=5734)
To loop a fixed list of reference models instead of one, run scripts/run_debug_eval.sh with the same subset arg. Model list, token budgets, and run-wide knobs (debug mode, rounds, temperature) are configurable at the top of each driver — see scripts/README.md for details.
If you already have patches saved (downloaded from Hugging Face, produced by an external agent, etc.), score them without re-running the model:
python src/evaluator.py \
--dataset_name bigcodebench \
--eval_model_name my-model \
--input_file my-model_on_bigcodebench_pdb_single_round_1.json \
--eval_set_name bigcodebench_pdb_single \
--max_iter 1The input format matches what bug_correct.py writes (a list of entries with task_id, buggy_code, gt_solution, debug_results.solution, gt_diff, …). At the end, the same [summary] line as above is printed. Output-file paths and schema are documented in scripts/README.md.
Every file under dataset/<bench>/data/full_data.json goes through the same pipeline:
python src/bug_generation.py \
--dataset_name bigcodebench \
--model_name openai/gpt-5.1-codex \
--input_file full_data.json \
--output_prefix oai_buggy_code \
--mode single \ # or --mode multi
--stride 2 \ # 2 for single, 4 for multi
--max_lines_per_block 1 \ # 1 for single, 2-4 for multi
--max_bugs 4 \ # max composed block count (bug_count)
--bug_per_time 20 \ # per-task LLM call budget
--max_gen_per_bin 5 \ # subsampling cap
--temperature 1.0 --max_tokens 32000The generator produces oai_buggy_code_<timestamp>.json under results/<bench>/bug_data/. Three-model fan-out drivers:
- scripts/run_bug_gen_single.sh — single-line (3 models × 2 datasets, then merge into
<bench>_pdb_single.json) - scripts/run_bug_gen_multi.sh — multi-line (reads per-model
long_<name>.jsonsplits, merges into<bench>_pdb_multi.json)
Both scripts preflight API keys against a cheap probe before spending credits, and they run all 6 (model × dataset) jobs concurrently.
The default pool is GPT-5.1-Codex + Claude-4.5-Sonnet + Gemini-2.5-Pro. Swap the MODELS array in run_bug_gen_*.sh to taste — anything supported by LiteLLM works.
Implement a DatasetHandler subclass under dataset/<your-dataset>/ and register it in dataset/__init__.py. See dataset/README.md for the full interface + vendored-sandbox layout. The rest of the pipeline is dataset-agnostic once the handler exists.
| flag | default | meaning |
|---|---|---|
--mode |
single |
single (1-line bugs) or multi (contiguous 2-4 line blocks) |
--stride |
2 |
minimum inter-block line gap during composition (s) |
--max_bugs |
4 |
k_max — max bugs composed into a single program |
--bug_per_time |
20 |
m_1 — LLM calls per (x, C_gt) pair for atomic-bug drafting |
--max_gen_per_bin |
5 |
m_3 — subsample cap per (task, bug_count) bin |
--max_lines_per_block |
1 single / 4 multi | block size cap for diff validation |
--temperature |
1.0 |
sampling temperature |
--max_tokens |
32000 |
thinking-budget cap |
All three flavors below start from an already-scored round-1 single-pass run (produced by scripts/run_debug_eval.sh or simple_debug_eval.sh) and reload it with --reload_first_round, so only rounds 2+ consume fresh API credits.
The debugger sees its prior failed patches appended to failed_attempts, and the template auto-switches to *_with_feedback between rounds. No unit-test content or error traces are exposed.
python src/bug_correct.py \
--dataset_name bigcodebench \
--input_file bigcodebench_pdb_single_hard.json \
--eval_set_name bigcodebench_pdb_single_hard \
--model_name openai/gpt-5.1-codex \
--debug_mode minimal --max_rounds 3 \
--reload_first_round \
--reload_result_file results/bigcodebench/debug_results/gpt-5.1-codex_on_bigcodebench_pdb_single_hard_round_1.json \
--reload_score_file results/bigcodebench/eval_results/gpt-5.1-codex_on_bigcodebench_pdb_single_hard_round_1_scores.json \
--temperature 1.0 --max_tokens 32000Same as iterative, but --use_tests puts the hidden unit tests into the prompt and --error_msg injects the sandbox's stdout/stderr for every failing attempt:
python src/bug_correct.py \
--dataset_name bigcodebench \
--input_file bigcodebench_pdb_single_hard.json \
--eval_set_name bigcodebench_pdb_single_hard_agentic \
--model_name openai/gpt-5.1-codex \
--debug_mode minimal --max_rounds 3 \
--use_tests --error_msg \
--reload_first_round \
--reload_result_file results/bigcodebench/debug_results/gpt-5.1-codex_on_bigcodebench_pdb_single_hard_round_1.json \
--reload_score_file results/bigcodebench/eval_results/gpt-5.1-codex_on_bigcodebench_pdb_single_hard_round_1_scores.json \
--temperature 1.0 --max_tokens 32000Swaps the single-pass dspy LM for an autonomous Claude Code subagent that can read the buggy code, execute tests, and iteratively patch. Routed through src/claude_code_wrapper.py.
python src/bug_correct.py \
--dataset_name bigcodebench \
--input_file bigcodebench_pdb_single_hard.json \
--eval_set_name bigcodebench_pdb_single_hard_claudecode \
--model_name claude-code-agent \
--use_claude_code \
--debug_mode minimal --max_rounds 1 \
--timeout 300--use_claude_code implies --use_tests internally (the agent is expected to run them), so unit tests are always available. --max_rounds 1 is typical here because the agent already iterates within its own loop.
Each round of every flavor writes <model>_on_<eval_set>_round_<k>.json + its scores file. Prompt templates (minimal vs free, *_with_feedback, *_unit) are documented in scripts/README.md.
bash scripts/run_bug_gen_single.sh # 3 generators × 2 datasets + mergeAfter scoring the 9 reference models on _pdb_single, filter to tasks solved perfectly by < 7 of 9:
bash scripts/run_debug_eval.sh single # populates eval_results/
# then run the hard-filter cell in visualize/visualize.ipynb, or use the
# self-contained routine at the bottom of scripts/push_to_hf.py which follows
# the same logic.This is exactly the procedure that produces the PDB-Single-Hard release set (5,734 examples).
Requires the dataset/<bench>/data/long_<name>.json per-generator splits of tasks with ≥ 35-line canonical solutions.
bash scripts/run_bug_gen_multi.shbash scripts/run_debug_eval.sh single-hard
bash scripts/run_debug_eval.sh multiFinal reproduction targets (union over BCB + LCB):
| subset | n | models evaluated | top precision model | top unit-score model |
|---|---|---|---|---|
| PDB-Single | 7,591 | 9 | Claude-Sonnet-4.5 | DeepSeek-V3.2-Thinking |
| PDB-Single-Hard | 5,751 | 9 | Claude-Sonnet-4.5 | DeepSeek-V3.2-Thinking |
| PDB-Multi | 256 | 9 | Claude-Sonnet-4.5 | DeepSeek-V3.2-Thinking |
@inproceedings{zhu2026pdb,
title = {Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?},
author = {Zhu, Wang Bill and Chai, Miaosen and Wang, Shangshang and Liu, Yejia and
Bian, Song and Dong, Honghua and Neiswanger, Willie and Jia, Robin},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
year = {2026},
}