Skip to content

qflen/tinycompress

Repository files navigation

tinycompress

A personal study of inference-time compression and optimization for small open-weight language models, run entirely on a single Apple M5 MacBook with 32 GB unified memory. Not a benchmark suite, not a paper, not a production system. Every number comes from a script in this repo that actually ran on this machine.

Speculative decoding animation

Speculative decoding on this hardware, visualized. Qwen2.5-0.5B drafts four tokens; Qwen2.5-3B verifies them in one forward pass. Yellow = proposed, green = accepted, red = rejected, blue = bonus correction from the target. The draft agrees often enough that multiple tokens fall out per target step, which is where the speedup comes from. Aggregate accept rates across four prompts are in docs/RESULTS.md; the GIF itself is regenerated by scripts/make_specdec_gif.py.

Scope

Thirteen methods across three Qwen2.5 base checkpoints:

  • Baselines: fp32 / fp16 / bf16 on CPU or MPS.
  • Quantization: dynamic int8, weight-only int8/int4, small GPTQ-style pass.
  • Compile / export: torch.compile, ONNX Runtime CPU.
  • KV + decode: KV-cache growth probe, int8 KV-cache, speculative decoding, SDPA probe.
  • Sparsity and distillation: magnitude pruning, short distillation run.

Everything is driven by one harness in src/tinycompress/eval/ and lands in one JSON per (model, method) under results/raw/. CoreML, CUDA kernels (bitsandbytes / AWQ / GPTQ / FlashAttention), and frontier-scale models are out of scope.

Hardware and software

Captured in every JSON. Apple M5, 32 GB unified memory, macOS 26.1, Python 3.14.3, torch 2.11.0 (MPS available). Sequential laptop runs; no active throttling management.

Models

Three Qwen2.5 base checkpoints, chosen so the same tokenizer is shared across the ladder (required for the speculative-decoding experiment):

  • Qwen/Qwen2.5-0.5B - draft model for spec decoding.
  • Qwen/Qwen2.5-1.5B - spec-decoding target; fills the "~1B" slot.
  • Qwen/Qwen2.5-3B - largest that fits comfortably in 32 GB with headroom.

Results

Per-model detail: 0.5B / 1.5B / 3B.

Figures: PPL by method / latency by method / KV growth / pruning cliff.

Reproducing

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,plots]"
# exact rerun of the numbers above:
pip install -r requirements-lock.txt -e .

pytest -q                                                            # smoke tests
python -m tinycompress.hardware_info                                 # sanity check
python scripts/run_baseline.py --model qwen2_5_0_5b --method fp32_cpu  # one cell
bash scripts/run_all.sh                                              # full matrix

The per-area runners under scripts/run_*.sh are idempotent. scripts/self_audit.py is the cross-cut check and must stay green ([OK] all checks passed).

References

Repo layout

src/tinycompress/   library: loader, harness, quant / compile / kv / prune / distill
scripts/            thin entry points (run_*.py, make_tables, make_plots, self_audit)
results/raw/        one JSON per (model, method); ground truth
results/tables/     derived tables (regenerated from raw)
results/figures/    plots (regenerated from raw)
docs/               METHODS.md, RESULTS.md, LIMITATIONS.md
tests/              hermetic, CPU-only, CI-friendly

License

MIT.

About

From-scratch implementations and measured benchmarks of LLM inference compression: int4/int8 quantization, GPTQ-like calibration, int8 KV cache, pruning, distillation, speculative decoding, torch.compile, and ONNX. Every number from a logged, self-audited run.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors