Improve installation DX: prebuilt wheels for 3.13/3.14/3.14t + declarative backend selection

## Problem

Installing `llama-cpp-python` with a GPU backend requires setting `CMAKE_ARGS` as an environment variable at build time:

```bash
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
```

This creates pain across the ecosystem:

1. **Not declarable in `pyproject.toml`** — Every downstream project needs custom Makefiles or install scripts with GPU auto-detection logic (macOS → Metal, nvidia-smi → CUDA, rocminfo → ROCm, fallback → OpenBLAS). This is duplicated across hundreds of projects.

2. **Cache invalidation is broken** — `pip` and `uv` cache wheels by package version, not by `CMAKE_ARGS`. A cached OpenBLAS wheel silently gets reused when Metal or CUDA is requested. Workaround: `--no-cache`, which defeats caching entirely.

3. **GPU prebuilt wheels stop at Python 3.12** — The Metal wheel CI ([`build-wheels-metal.yaml`](https://github.com/abetlen/llama-cpp-python/blob/main/.github/workflows/build-wheels-metal.yaml)) is hardcoded to `CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-*"`. The CUDA wheel CI ([`build-wheels-cuda.yaml`](https://github.com/abetlen/llama-cpp-python/blob/main/.github/workflows/build-wheels-cuda.yaml)) has its matrix pinned to Python 3.9-3.12. CPU-only wheels include 3.13 (via default cibuildwheel config in [`build-and-release.yaml`](https://github.com/abetlen/llama-cpp-python/blob/main/.github/workflows/build-and-release.yaml)), but the arm64 job there also pins to cp38-cp312. No workflow produces 3.14 or free-threaded (3.13t/3.14t) wheels. Python 3.13 has been stable since Oct 2024, 3.14 since Oct 2025. Free-threaded builds are increasingly important — vLLM, llguidance, and the broader no-GIL ecosystem depend on them.

Current state of published wheel indexes:

| Index | cp313 | cp314 | Free-threaded |
|-------|-------|-------|---------------|
| CPU (`/whl/cpu/`) | ✅ | ❌ | ❌ |
| Metal (`/whl/metal/`) | ❌ | ❌ | ❌ |
| CUDA (`/whl/cu1xx/`) | ❌ | ❌ | ❌ |

## Proposed changes

### 1. Expand prebuilt wheel matrix (highest impact, smallest change)

Update `CIBW_BUILD` in Metal/CUDA workflows and add free-threaded support. This is the single highest-impact change — it eliminates source builds for most users.

**`build-wheels-metal.yaml`:**

Upgrade `cibuildwheel` from `v2.22.0` to `v3.x` ([3.0 added cp314/cp314t support](https://cibuildwheel.pypa.io/en/stable/changelog/)). In cibuildwheel 3.0, cp314t is built by default (free-threading is no longer experimental in 3.14), and cp313t requires `CIBW_ENABLE: cpython-freethreading`.

```diff
-        uses: pypa/cibuildwheel@v2.22.0
+        uses: pypa/cibuildwheel@v3.0.0
         env:
-          CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-*"
+          CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-* cp313-* cp314-*"
+          CIBW_ENABLE: cpython-freethreading
```

**`build-and-release.yaml`** — same cibuildwheel upgrade, and update the `build_wheels_arm64` job:

```diff
-          CIBW_BUILD: "cp38-* cp39-* cp310-* cp311-* cp312-*"
+          CIBW_BUILD: "cp38-* cp39-* cp310-* cp311-* cp312-* cp313-* cp314-*"
+          CIBW_ENABLE: cpython-freethreading
```

**`build-wheels-cuda.yaml`** — uses a different build system (`python -m build --wheel` with a PowerShell matrix). The `pyver` matrix would need `"3.13", "3.14"` added.

With prebuilt wheels, any downstream project can use `uv`'s declarative index support:

```toml
# pyproject.toml — zero Makefile, zero CMAKE_ARGS
[project]
dependencies = ["llama-cpp-python~=0.3"]

[tool.uv.sources]
llama-cpp-python = [
  { index = "llama-metal", marker = "sys_platform == 'darwin'" },
  { index = "llama-cpu",   marker = "sys_platform == 'linux'" },
]

[[tool.uv.index]]
name = "llama-metal"
url = "https://abetlen.github.io/llama-cpp-python/whl/metal"
explicit = true

[[tool.uv.index]]
name = "llama-cpu"
url = "https://abetlen.github.io/llama-cpp-python/whl/cpu"
explicit = true
```

### 2. Document `--config-settings` as the source-build path

Since the build backend is `scikit-build-core`, cmake args can be passed via the standard PEP 517 config-settings interface:

```bash
pip install llama-cpp-python -C cmake.args="-DGGML_METAL=on"
# or with uv:
uv pip install llama-cpp-python -C cmake.args="-DGGML_METAL=on"
```

This is cleaner than the `CMAKE_ARGS` env var — it's the standard PEP 517 mechanism, more explicit, and discoverable. It's already supported via scikit-build-core but not documented in the README or install docs.

### 3. (Future) Adopt PEP 817 Wheel Variants

[PEP 817](https://peps.python.org/pep-0817/) (draft, Dec 2025) introduces a standard mechanism for GPU/accelerator wheel variants. PyTorch 2.9 already ships experimental variant-enabled wheels. Once PEP 817 is accepted and tool support lands, `llama-cpp-python` could publish variant wheels that are auto-selected by the installer:

```bash
# Future: just works, installer picks Metal/CUDA/CPU automatically
pip install llama-cpp-python
```

This is mentioned for context only — the actionable items are (1) and (2) above.

## Ecosystem context

- **Quansight offered funded engineering help** for free-threaded support in #2103 (via vLLM ecosystem work) — awaiting maintainer signal
- **~470K monthly PyPI downloads** ([pypistats](https://pypistats.org/packages/llama-cpp-python)) — every project using this beyond toy scripts hits this install wall
- **How others solved it:** PyTorch uses per-backend index URLs + PEP 817 variants; ONNX Runtime publishes separate PyPI packages per backend (`onnxruntime-gpu`, `onnxruntime-silicon`)

## Related

**Wheel matrix gaps (same root cause):**
- #2103 — Pre-built wheels for Python 3.14 and 3.14 free-threaded
- #2130 — Pre-built CPU-only wheel for Windows (cp313)
- #2068 — Where can I download wheel for CUDA 12.8?
- #2091 — CUDA 12.8 wheel request

**Wheel variants / long-term packaging:**
- #2092 — Add support for experimental wheel variants (wheelnext)
- #1506 — Multi-arch support for pre-built CPU wheel (by @abetlen)
- [Discussion #1875](https://github.com/abetlen/llama-cpp-python/discussions/1875) — Automating pre-building of wheels for all platforms

**Downstream impact of missing wheels:**
- #2118 — Installation deadlock on Hugging Face Spaces (musl/glibc mismatch)
- #2113 — No working wheels for Debian/Ubuntu

Happy to submit a PR for (1) and (2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve installation DX: prebuilt wheels for 3.13/3.14/3.14t + declarative backend selection #2136

Problem

Proposed changes

1. Expand prebuilt wheel matrix (highest impact, smallest change)

2. Document `--config-settings` as the source-build path

3. (Future) Adopt PEP 817 Wheel Variants

Ecosystem context

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Index	cp313	cp314	Free-threaded
CPU (`/whl/cpu/`)	✅	❌	❌
Metal (`/whl/metal/`)	❌	❌	❌
CUDA (`/whl/cu1xx/`)	❌	❌	❌

Improve installation DX: prebuilt wheels for 3.13/3.14/3.14t + declarative backend selection #2136

Description

Problem

Proposed changes

1. Expand prebuilt wheel matrix (highest impact, smallest change)

2. Document --config-settings as the source-build path

3. (Future) Adopt PEP 817 Wheel Variants

Ecosystem context

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. Document `--config-settings` as the source-build path