Skip to content

BenedictS24/bbq

Repository files navigation

BBQ — Evaluating Memorization & Performance of Pythia-12B

This repository contains scripts to evaluate memorization and performance of Pythia-12B models across different quantization modes (FP16 baseline, Int8, NF4, FP4). The main goals are to measure how strongly models memorize training content (e.g., produce exact token sequences) and to quantify how quantization affects accuracy, runtime, and memory usage.


Project structure

  • download_scripts/ — Helper scripts to download models, datasets, and to install the evaluation framework.
  • quantization_scripts/ — Script to create quantized model variants (nf4bit, fp4bit, 8bit).
  • test_memorization/ — Core scripts for memorization experiments (e.g., k32_memorization_eval.py, multi_k_memorization_eval.py).
  • test_performance/ — Wrapper to run tasks with lm-evaluation-harness (run_eval.py).
  • plots/plotting_scripts/ — Analysis and visualization scripts (e.g., visualize_memorization_experiment.py).
  • data/ — Result data (split into mem_eval_results/ and perf_eval_results/, each with deduped/ and duped/).
  • models/ — Downloaded models and their quantized versions.
  • main_scripts/ — Convenience scripts for full workflows (run_all_evals.sh, run_all_mem_evals.sh, run_all_perf_evals.sh).

What & Why (Motivation)

  • Measure to what extent large language models memorize training data and how performance changes with different context lengths k.
  • Compare quantization modes (Int8, NF4, FP4) to assess their effect on memorization, runtime, and VRAM — important for resource-constrained deployments.
  • Produce reproducible results (JSONL) and figures that make interpretation straightforward.

Key scripts & what they do

1) Downloading & Setup

  • download_scripts/download_pythia12b.py — Download a specific revision of Pythia-12B (e.g., step143000) into models/.
  • download_scripts/download_pythia-memorized-evals.py — Download the pythia-memorized-evals dataset (cached under test_memorization/).
  • download_scripts/download_and_install_lm-evaluation-harness.py — Clone and install lm-evaluation-harness and necessary backends (HF, vLLM, bitsandbytes).

Why? These scripts automate setup (models, datasets, evaluation tooling) so experiments are repeatable.


2) Quantization

  • quantization_scripts/quantize_pythia12b.py — Create quantized model variants (nf4bit, fp4bit, 8bit) and save them as new directories in models/.

Why? To observe how 4-bit and 8-bit quantization affects memorization, runtime, and resource consumption.

Note: Quantization is a required step for this project — the evaluation workflows expect quantized variants to exist and will compare them to the FP16 baseline.


3) Memorization Evaluation

  • test_memorization/k32_memorization_eval.py — Evaluate a fixed context length k (default: 32), measure token accuracy, exact matches, successive-correct tokens, runtime, and VRAM; results are saved as JSONL under data/mem_eval_results/{deduped|duped}/.
  • test_memorization/multi_k_memorization_eval.py — Run evaluations across multiple k values (e.g., 4..48) and support both start_of_sequence and end_of_sequence contexts.

Important arguments (examples):

--model_list <model1> <model2> ...
--k / --start_k / --end_k
--device cuda:0
--number_of_tests N
--eval_token_count M
--save_results

Why? To analyze how the number of context tokens (k) influences the probability that a model reproduces target tokens exactly.


4) Performance Evaluation

  • test_performance/run_eval.py — Wrapper around lm_eval (lm-evaluation-harness). Runs benchmarks across tasks (ARC, MMLU, GSM8K, etc.) and stores outputs in data/perf_eval_results/.

Note: Batch size is auto-selected, but for 8-bit models it is conservatively set to 7 to avoid OOM.


5) Plotting & Visualization

  • plots/plotting_scripts/visualize_memorization_experiment.py — Generates performance, efficiency, variability, and relative-difference plots from JSONL results.
  • plots/plotting_scripts/k32_6000_mem_eval.py — Focused visualization for K=32 with 6000 tests (paper-style plots).
  • plots/plotting_scripts/merged_visualize_memorization_experiment.py — Compare start- vs end-of-sequence results with overlay plots.

Why? Visualizations make it easy to identify trends across k, quantization modes, and resource metrics.


Setup

Using Conda (recommended)

This project requires Python 3.10. To set up your environment with conda:

conda create -n bbq python=3.10
conda activate bbq
pip install -r requirements.txt

Recommended workflow

  1. Check system:
python3 test_scripts/test_versions.py
  1. Install dependencies and set up tools (downloads dataset and installs lm-evaluation-harness):
bash main_scripts/run_all_installs.sh
  1. Download models, quantize (required) and run evaluations:
  • Full end-to-end run (downloads models, creates quantized variants, runs memorization and performance evaluations for both "duped" and "deduped"):
bash main_scripts/run_all_evals.sh
  • Run memorization evaluations only:
bash main_scripts/run_all_mem_evals.sh
  • Run performance evaluations only:
bash main_scripts/run_all_perf_evals.sh
  • Generate all standard plots:
bash main_scripts/run_all_plotting.sh

Notes:

  • Quantization is required by the experimental protocol; the main scripts perform quantization as part of their flow.
  • Use the individual Python scripts (in test_memorization/, test_performance/, or plots/plotting_scripts/) directly only if you need to customize arguments. Use --help for details.

Results & output

  • Memorization results are saved as JSONL in data/mem_eval_results/{deduped|duped}/.
  • Performance results are saved in data/perf_eval_results/{deduped|duped}/ (one timestamped folder per model).
  • Plots are written to plots/mem_eval/{deduped|duped}/ (created automatically).

Notes & best practices

  • Use a GPU (cuda:0) for realistic runtimes; scripts default to cuda:0.
  • Watch VRAM when loading FP16 models; reduce batch sizes or use quantized variants if you encounter OOM.
  • Scripts use greedy generation (do_sample=False) for determinism where possible, but results may still depend on environment and hardware.
  • For reproducibility, use the --random_seed options in the memorization scripts.

Requirements

  • See requirements.txt for required Python packages.
  • Recommended: Python 3.10+, PyTorch with CUDA support, transformers, datasets, bitsandbytes (for quantization), and lm-evaluation-harness.

Contact & further information

If you have questions about using or extending the experiments, adding new models, or testing other quantization methods, please open an issue or send a request.

Good luck reproducing and analyzing the experiments.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors