-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Problem
Installing llama-cpp-python with a GPU backend requires setting CMAKE_ARGS as an environment variable at build time:
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-pythonThis creates pain across the ecosystem:
-
Not declarable in
pyproject.toml— Every downstream project needs custom Makefiles or install scripts with GPU auto-detection logic (macOS → Metal, nvidia-smi → CUDA, rocminfo → ROCm, fallback → OpenBLAS). This is duplicated across hundreds of projects. -
Cache invalidation is broken —
pipanduvcache wheels by package version, not byCMAKE_ARGS. A cached OpenBLAS wheel silently gets reused when Metal or CUDA is requested. Workaround:--no-cache, which defeats caching entirely. -
GPU prebuilt wheels stop at Python 3.12 — The Metal wheel CI (
build-wheels-metal.yaml) is hardcoded toCIBW_BUILD: "cp39-* cp310-* cp311-* cp312-*". The CUDA wheel CI (build-wheels-cuda.yaml) has its matrix pinned to Python 3.9-3.12. CPU-only wheels include 3.13 (via default cibuildwheel config inbuild-and-release.yaml), but the arm64 job there also pins to cp38-cp312. No workflow produces 3.14 or free-threaded (3.13t/3.14t) wheels. Python 3.13 has been stable since Oct 2024, 3.14 since Oct 2025. Free-threaded builds are increasingly important — vLLM, llguidance, and the broader no-GIL ecosystem depend on them.
Current state of published wheel indexes:
| Index | cp313 | cp314 | Free-threaded |
|---|---|---|---|
CPU (/whl/cpu/) |
✅ | ❌ | ❌ |
Metal (/whl/metal/) |
❌ | ❌ | ❌ |
CUDA (/whl/cu1xx/) |
❌ | ❌ | ❌ |
Proposed changes
1. Expand prebuilt wheel matrix (highest impact, smallest change)
Update CIBW_BUILD in Metal/CUDA workflows and add free-threaded support. This is the single highest-impact change — it eliminates source builds for most users.
build-wheels-metal.yaml:
Upgrade cibuildwheel from v2.22.0 to v3.x (3.0 added cp314/cp314t support). In cibuildwheel 3.0, cp314t is built by default (free-threading is no longer experimental in 3.14), and cp313t requires CIBW_ENABLE: cpython-freethreading.
- uses: pypa/cibuildwheel@v2.22.0
+ uses: pypa/cibuildwheel@v3.0.0
env:
- CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-*"
+ CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-* cp313-* cp314-*"
+ CIBW_ENABLE: cpython-freethreadingbuild-and-release.yaml — same cibuildwheel upgrade, and update the build_wheels_arm64 job:
- CIBW_BUILD: "cp38-* cp39-* cp310-* cp311-* cp312-*"
+ CIBW_BUILD: "cp38-* cp39-* cp310-* cp311-* cp312-* cp313-* cp314-*"
+ CIBW_ENABLE: cpython-freethreadingbuild-wheels-cuda.yaml — uses a different build system (python -m build --wheel with a PowerShell matrix). The pyver matrix would need "3.13", "3.14" added.
With prebuilt wheels, any downstream project can use uv's declarative index support:
# pyproject.toml — zero Makefile, zero CMAKE_ARGS
[project]
dependencies = ["llama-cpp-python~=0.3"]
[tool.uv.sources]
llama-cpp-python = [
{ index = "llama-metal", marker = "sys_platform == 'darwin'" },
{ index = "llama-cpu", marker = "sys_platform == 'linux'" },
]
[[tool.uv.index]]
name = "llama-metal"
url = "https://abetlen.github.io/llama-cpp-python/whl/metal"
explicit = true
[[tool.uv.index]]
name = "llama-cpu"
url = "https://abetlen.github.io/llama-cpp-python/whl/cpu"
explicit = true2. Document --config-settings as the source-build path
Since the build backend is scikit-build-core, cmake args can be passed via the standard PEP 517 config-settings interface:
pip install llama-cpp-python -C cmake.args="-DGGML_METAL=on"
# or with uv:
uv pip install llama-cpp-python -C cmake.args="-DGGML_METAL=on"This is cleaner than the CMAKE_ARGS env var — it's the standard PEP 517 mechanism, more explicit, and discoverable. It's already supported via scikit-build-core but not documented in the README or install docs.
3. (Future) Adopt PEP 817 Wheel Variants
PEP 817 (draft, Dec 2025) introduces a standard mechanism for GPU/accelerator wheel variants. PyTorch 2.9 already ships experimental variant-enabled wheels. Once PEP 817 is accepted and tool support lands, llama-cpp-python could publish variant wheels that are auto-selected by the installer:
# Future: just works, installer picks Metal/CUDA/CPU automatically
pip install llama-cpp-pythonThis is mentioned for context only — the actionable items are (1) and (2) above.
Ecosystem context
- Quansight offered funded engineering help for free-threaded support in Pre-built wheels for Python 3.14 and 3.14 free-threaded #2103 (via vLLM ecosystem work) — awaiting maintainer signal
- ~470K monthly PyPI downloads (pypistats) — every project using this beyond toy scripts hits this install wall
- How others solved it: PyTorch uses per-backend index URLs + PEP 817 variants; ONNX Runtime publishes separate PyPI packages per backend (
onnxruntime-gpu,onnxruntime-silicon)
Related
Wheel matrix gaps (same root cause):
- Pre-built wheels for Python 3.14 and 3.14 free-threaded #2103 — Pre-built wheels for Python 3.14 and 3.14 free-threaded
- Pre-built CPU-only wheel for Windows (cp313) for version 0.3.16+ (Gemma 3 support) #2130 — Pre-built CPU-only wheel for Windows (cp313)
- Where can I download wheel for Cuda 12.8? Trying to install llama.cpp to use with ComfyUI custom nodes. #2068 — Where can I download wheel for CUDA 12.8?
- cu128 wheel #2091 — CUDA 12.8 wheel request
Wheel variants / long-term packaging:
- Add support for experimental wheel variants (i.e., wheelnext) #2092 — Add support for experimental wheel variants (wheelnext)
- Multi-arch support for pre-built cpu wheel #1506 — Multi-arch support for pre-built CPU wheel (by @abetlen)
- Discussion #1875 — Automating pre-building of wheels for all platforms
Downstream impact of missing wheels:
- [Deployment Issue] Installation deadlock on Hugging Face Spaces (CPU): Wheels fail (musl/glibc mismatch) & Source builds timeout #2118 — Installation deadlock on Hugging Face Spaces (musl/glibc mismatch)
- Please add support or tell me if there are ANY wheels for llama.cpp that can run on debian/ubuntu #2113 — No working wheels for Debian/Ubuntu
Happy to submit a PR for (1) and (2).