BBQ — Evaluating Memorization & Performance of Pythia-12B

This repository contains scripts to evaluate memorization and performance of Pythia-12B models across different quantization modes (FP16 baseline, Int8, NF4, FP4). The main goals are to measure how strongly models memorize training content (e.g., produce exact token sequences) and to quantify how quantization affects accuracy, runtime, and memory usage.

Project structure

download_scripts/ — Helper scripts to download models, datasets, and to install the evaluation framework.
quantization_scripts/ — Script to create quantized model variants (nf4bit, fp4bit, 8bit).
test_memorization/ — Core scripts for memorization experiments (e.g., k32_memorization_eval.py, multi_k_memorization_eval.py).
test_performance/ — Wrapper to run tasks with lm-evaluation-harness (run_eval.py).
plots/plotting_scripts/ — Analysis and visualization scripts (e.g., visualize_memorization_experiment.py).
data/ — Result data (split into mem_eval_results/ and perf_eval_results/, each with deduped/ and duped/).
models/ — Downloaded models and their quantized versions.
main_scripts/ — Convenience scripts for full workflows (run_all_evals.sh, run_all_mem_evals.sh, run_all_perf_evals.sh).

What & Why (Motivation)

Measure to what extent large language models memorize training data and how performance changes with different context lengths k.
Compare quantization modes (Int8, NF4, FP4) to assess their effect on memorization, runtime, and VRAM — important for resource-constrained deployments.
Produce reproducible results (JSONL) and figures that make interpretation straightforward.

Key scripts & what they do

1) Downloading & Setup

download_scripts/download_pythia12b.py — Download a specific revision of Pythia-12B (e.g., step143000) into models/.
download_scripts/download_pythia-memorized-evals.py — Download the pythia-memorized-evals dataset (cached under test_memorization/).
download_scripts/download_and_install_lm-evaluation-harness.py — Clone and install lm-evaluation-harness and necessary backends (HF, vLLM, bitsandbytes).

Why? These scripts automate setup (models, datasets, evaluation tooling) so experiments are repeatable.

2) Quantization

quantization_scripts/quantize_pythia12b.py — Create quantized model variants (nf4bit, fp4bit, 8bit) and save them as new directories in models/.

Why? To observe how 4-bit and 8-bit quantization affects memorization, runtime, and resource consumption.

Note: Quantization is a required step for this project — the evaluation workflows expect quantized variants to exist and will compare them to the FP16 baseline.

3) Memorization Evaluation

test_memorization/k32_memorization_eval.py — Evaluate a fixed context length k (default: 32), measure token accuracy, exact matches, successive-correct tokens, runtime, and VRAM; results are saved as JSONL under data/mem_eval_results/{deduped|duped}/.
test_memorization/multi_k_memorization_eval.py — Run evaluations across multiple k values (e.g., 4..48) and support both start_of_sequence and end_of_sequence contexts.

Important arguments (examples):

--model_list <model1> <model2> ...
--k / --start_k / --end_k
--device cuda:0
--number_of_tests N
--eval_token_count M
--save_results

Why? To analyze how the number of context tokens (k) influences the probability that a model reproduces target tokens exactly.

4) Performance Evaluation

test_performance/run_eval.py — Wrapper around lm_eval (lm-evaluation-harness). Runs benchmarks across tasks (ARC, MMLU, GSM8K, etc.) and stores outputs in data/perf_eval_results/.

Note: Batch size is auto-selected, but for 8-bit models it is conservatively set to 7 to avoid OOM.

5) Plotting & Visualization

plots/plotting_scripts/visualize_memorization_experiment.py — Generates performance, efficiency, variability, and relative-difference plots from JSONL results.
plots/plotting_scripts/k32_6000_mem_eval.py — Focused visualization for K=32 with 6000 tests (paper-style plots).
plots/plotting_scripts/merged_visualize_memorization_experiment.py — Compare start- vs end-of-sequence results with overlay plots.

Why? Visualizations make it easy to identify trends across k, quantization modes, and resource metrics.

Setup

Using Conda (recommended)

This project requires Python 3.10. To set up your environment with conda:

conda create -n bbq python=3.10
conda activate bbq
pip install -r requirements.txt

Recommended workflow

Check system:

python3 test_scripts/test_versions.py

Install dependencies and set up tools (downloads dataset and installs lm-evaluation-harness):

bash main_scripts/run_all_installs.sh

Download models, quantize (required) and run evaluations:

Full end-to-end run (downloads models, creates quantized variants, runs memorization and performance evaluations for both "duped" and "deduped"):

bash main_scripts/run_all_evals.sh

Run memorization evaluations only:

bash main_scripts/run_all_mem_evals.sh

Run performance evaluations only:

bash main_scripts/run_all_perf_evals.sh

Generate all standard plots:

bash main_scripts/run_all_plotting.sh

Notes:

Quantization is required by the experimental protocol; the main scripts perform quantization as part of their flow.
Use the individual Python scripts (in test_memorization/, test_performance/, or plots/plotting_scripts/) directly only if you need to customize arguments. Use --help for details.

Results & output

Memorization results are saved as JSONL in data/mem_eval_results/{deduped|duped}/.
Performance results are saved in data/perf_eval_results/{deduped|duped}/ (one timestamped folder per model).
Plots are written to plots/mem_eval/{deduped|duped}/ (created automatically).

Notes & best practices

Use a GPU (cuda:0) for realistic runtimes; scripts default to cuda:0.
Watch VRAM when loading FP16 models; reduce batch sizes or use quantized variants if you encounter OOM.
Scripts use greedy generation (do_sample=False) for determinism where possible, but results may still depend on environment and hardware.
For reproducibility, use the --random_seed options in the memorization scripts.

Requirements

See requirements.txt for required Python packages.
Recommended: Python 3.10+, PyTorch with CUDA support, transformers, datasets, bitsandbytes (for quantization), and lm-evaluation-harness.

Contact & further information

If you have questions about using or extending the experiments, adding new models, or testing other quantization methods, please open an issue or send a request.

Good luck reproducing and analyzing the experiments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BBQ — Evaluating Memorization & Performance of Pythia-12B

Project structure

What & Why (Motivation)

Key scripts & what they do

1) Downloading & Setup

2) Quantization

3) Memorization Evaluation

4) Performance Evaluation

5) Plotting & Visualization

Setup

Using Conda (recommended)

Recommended workflow

Results & output

Notes & best practices

Requirements

Contact & further information

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
data		data
download_scripts		download_scripts
main_scripts		main_scripts
plots		plots
quantization_scripts		quantization_scripts
tabels		tabels
test_memorization		test_memorization
test_performance		test_performance
test_scripts		test_scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

BenedictS24/bbq

Folders and files

Latest commit

History

Repository files navigation

BBQ — Evaluating Memorization & Performance of Pythia-12B

Project structure

What & Why (Motivation)

Key scripts & what they do

1) Downloading & Setup

2) Quantization

3) Memorization Evaluation

4) Performance Evaluation

5) Plotting & Visualization

Setup

Using Conda (recommended)

Recommended workflow

Results & output

Notes & best practices

Requirements

Contact & further information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages