This repository contains scripts to evaluate memorization and performance of Pythia-12B models across different quantization modes (FP16 baseline, Int8, NF4, FP4). The main goals are to measure how strongly models memorize training content (e.g., produce exact token sequences) and to quantify how quantization affects accuracy, runtime, and memory usage.
download_scripts/— Helper scripts to download models, datasets, and to install the evaluation framework.quantization_scripts/— Script to create quantized model variants (nf4bit,fp4bit,8bit).test_memorization/— Core scripts for memorization experiments (e.g.,k32_memorization_eval.py,multi_k_memorization_eval.py).test_performance/— Wrapper to run tasks withlm-evaluation-harness(run_eval.py).plots/plotting_scripts/— Analysis and visualization scripts (e.g.,visualize_memorization_experiment.py).data/— Result data (split intomem_eval_results/andperf_eval_results/, each withdeduped/andduped/).models/— Downloaded models and their quantized versions.main_scripts/— Convenience scripts for full workflows (run_all_evals.sh,run_all_mem_evals.sh,run_all_perf_evals.sh).
- Measure to what extent large language models memorize training data and how performance changes with different context lengths
k. - Compare quantization modes (Int8, NF4, FP4) to assess their effect on memorization, runtime, and VRAM — important for resource-constrained deployments.
- Produce reproducible results (JSONL) and figures that make interpretation straightforward.
download_scripts/download_pythia12b.py— Download a specific revision of Pythia-12B (e.g.,step143000) intomodels/.download_scripts/download_pythia-memorized-evals.py— Download thepythia-memorized-evalsdataset (cached undertest_memorization/).download_scripts/download_and_install_lm-evaluation-harness.py— Clone and installlm-evaluation-harnessand necessary backends (HF, vLLM, bitsandbytes).
Why? These scripts automate setup (models, datasets, evaluation tooling) so experiments are repeatable.
quantization_scripts/quantize_pythia12b.py— Create quantized model variants (nf4bit,fp4bit,8bit) and save them as new directories inmodels/.
Why? To observe how 4-bit and 8-bit quantization affects memorization, runtime, and resource consumption.
Note: Quantization is a required step for this project — the evaluation workflows expect quantized variants to exist and will compare them to the FP16 baseline.
test_memorization/k32_memorization_eval.py— Evaluate a fixed context lengthk(default: 32), measure token accuracy, exact matches, successive-correct tokens, runtime, and VRAM; results are saved as JSONL underdata/mem_eval_results/{deduped|duped}/.test_memorization/multi_k_memorization_eval.py— Run evaluations across multiplekvalues (e.g., 4..48) and support bothstart_of_sequenceandend_of_sequencecontexts.
Important arguments (examples):
--model_list <model1> <model2> ...
--k / --start_k / --end_k
--device cuda:0
--number_of_tests N
--eval_token_count M
--save_results
Why? To analyze how the number of context tokens (k) influences the probability that a model reproduces target tokens exactly.
test_performance/run_eval.py— Wrapper aroundlm_eval(lm-evaluation-harness). Runs benchmarks across tasks (ARC, MMLU, GSM8K, etc.) and stores outputs indata/perf_eval_results/.
Note: Batch size is auto-selected, but for 8-bit models it is conservatively set to 7 to avoid OOM.
plots/plotting_scripts/visualize_memorization_experiment.py— Generates performance, efficiency, variability, and relative-difference plots from JSONL results.plots/plotting_scripts/k32_6000_mem_eval.py— Focused visualization for K=32 with 6000 tests (paper-style plots).plots/plotting_scripts/merged_visualize_memorization_experiment.py— Compare start- vs end-of-sequence results with overlay plots.
Why? Visualizations make it easy to identify trends across k, quantization modes, and resource metrics.
This project requires Python 3.10. To set up your environment with conda:
conda create -n bbq python=3.10
conda activate bbq
pip install -r requirements.txt- Check system:
python3 test_scripts/test_versions.py- Install dependencies and set up tools (downloads dataset and installs
lm-evaluation-harness):
bash main_scripts/run_all_installs.sh- Download models, quantize (required) and run evaluations:
- Full end-to-end run (downloads models, creates quantized variants, runs memorization and performance evaluations for both "duped" and "deduped"):
bash main_scripts/run_all_evals.sh- Run memorization evaluations only:
bash main_scripts/run_all_mem_evals.sh- Run performance evaluations only:
bash main_scripts/run_all_perf_evals.sh- Generate all standard plots:
bash main_scripts/run_all_plotting.shNotes:
- Quantization is required by the experimental protocol; the main scripts perform quantization as part of their flow.
- Use the individual Python scripts (in
test_memorization/,test_performance/, orplots/plotting_scripts/) directly only if you need to customize arguments. Use--helpfor details.
- Memorization results are saved as JSONL in
data/mem_eval_results/{deduped|duped}/. - Performance results are saved in
data/perf_eval_results/{deduped|duped}/(one timestamped folder per model). - Plots are written to
plots/mem_eval/{deduped|duped}/(created automatically).
- Use a GPU (
cuda:0) for realistic runtimes; scripts default tocuda:0. - Watch VRAM when loading FP16 models; reduce batch sizes or use quantized variants if you encounter OOM.
- Scripts use greedy generation (
do_sample=False) for determinism where possible, but results may still depend on environment and hardware. - For reproducibility, use the
--random_seedoptions in the memorization scripts.
- See
requirements.txtfor required Python packages. - Recommended: Python 3.10+, PyTorch with CUDA support,
transformers,datasets,bitsandbytes(for quantization), andlm-evaluation-harness.
If you have questions about using or extending the experiments, adding new models, or testing other quantization methods, please open an issue or send a request.
Good luck reproducing and analyzing the experiments.