tinycompress

A personal study of inference-time compression and optimization for small open-weight language models, run entirely on a single Apple M5 MacBook with 32 GB unified memory. Not a benchmark suite, not a paper, not a production system. Every number comes from a script in this repo that actually ran on this machine.

Speculative decoding on this hardware, visualized. Qwen2.5-0.5B drafts four tokens; Qwen2.5-3B verifies them in one forward pass. Yellow = proposed, green = accepted, red = rejected, blue = bonus correction from the target. The draft agrees often enough that multiple tokens fall out per target step, which is where the speedup comes from. Aggregate accept rates across four prompts are in docs/RESULTS.md; the GIF itself is regenerated by scripts/make_specdec_gif.py.

Scope

Thirteen methods across three Qwen2.5 base checkpoints:

Baselines: fp32 / fp16 / bf16 on CPU or MPS.
Quantization: dynamic int8, weight-only int8/int4, small GPTQ-style pass.
Compile / export: torch.compile, ONNX Runtime CPU.
KV + decode: KV-cache growth probe, int8 KV-cache, speculative decoding, SDPA probe.
Sparsity and distillation: magnitude pruning, short distillation run.

Everything is driven by one harness in src/tinycompress/eval/ and lands in one JSON per (model, method) under results/raw/. CoreML, CUDA kernels (bitsandbytes / AWQ / GPTQ / FlashAttention), and frontier-scale models are out of scope.

Hardware and software

Captured in every JSON. Apple M5, 32 GB unified memory, macOS 26.1, Python 3.14.3, torch 2.11.0 (MPS available). Sequential laptop runs; no active throttling management.

Models

Three Qwen2.5 base checkpoints, chosen so the same tokenizer is shared across the ladder (required for the speculative-decoding experiment):

Qwen/Qwen2.5-0.5B - draft model for spec decoding.
Qwen/Qwen2.5-1.5B - spec-decoding target; fills the "~1B" slot.
Qwen/Qwen2.5-3B - largest that fits comfortably in 32 GB with headroom.

Results

results/tables/summary.md - one row per (model, method); peak MB, forward ms, tok/s, PPL.
docs/RESULTS.md - headline observations with supporting numbers, figures, cross-run variance, and domain-shift checks.
docs/METHODS.md - what each method measures, how, and the harness dataflow.
docs/LIMITATIONS.md - what this does not claim.

Per-model detail: 0.5B / 1.5B / 3B.

Figures: PPL by method / latency by method / KV growth / pruning cliff.

Reproducing

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,plots]"
# exact rerun of the numbers above:
pip install -r requirements-lock.txt -e .

pytest -q                                                            # smoke tests
python -m tinycompress.hardware_info                                 # sanity check
python scripts/run_baseline.py --model qwen2_5_0_5b --method fp32_cpu  # one cell
bash scripts/run_all.sh                                              # full matrix

The per-area runners under scripts/run_*.sh are idempotent. scripts/self_audit.py is the cross-cut check and must stay green ([OK] all checks passed).

References

Qwen2.5: huggingface.co/Qwen/Qwen2.5-0.5B
PyTorch dynamic quantization: docs.pytorch.org
torch.compile: docs.pytorch.org
ONNX Runtime: onnxruntime.ai
Speculative decoding, Leviathan et al. 2023: arxiv.org/abs/2211.17192
GPTQ, Frantar et al. 2022: arxiv.org/abs/2210.17323
Distillation, Hinton et al. 2015: arxiv.org/abs/1503.02531
wikitext-2-raw-v1: huggingface.co/datasets/Salesforce/wikitext

Repo layout

src/tinycompress/   library: loader, harness, quant / compile / kv / prune / distill
scripts/            thin entry points (run_*.py, make_tables, make_plots, self_audit)
results/raw/        one JSON per (model, method); ground truth
results/tables/     derived tables (regenerated from raw)
results/figures/    plots (regenerated from raw)
docs/               METHODS.md, RESULTS.md, LIMITATIONS.md
tests/              hermetic, CPU-only, CI-friendly

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
docs		docs
results		results
scripts		scripts
src/tinycompress		src/tinycompress
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prompts.txt		prompts.txt
pyproject.toml		pyproject.toml
requirements-lock.txt		requirements-lock.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tinycompress

Scope

Hardware and software

Models

Results

Reproducing

References

Repo layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tinycompress

Scope

Hardware and software

Models

Results

Reproducing

References

Repo layout

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages