Run llama.cpp in a GPU accelerated Docker container.
By default, the service requires a CUDA capable GPU with at least 8GB+ of VRAM. If you don't have an Nvidia GPU with CUDA then the CPU version will be built and used instead.
./docker-task build
./docker-task upAfter starting up the chat server will be available at http://localhost:8080.
Options are specified as environment variables in the docker-compose.yml file. By default, the following options are set:
LLAMA_ARG_CTX_SIZE: The context size to use (default is 2048)LLAMA_ARG_HF_REPO: The repository and quantization of the HuggingFace model to use (default isbartowski/Meta-Llama-3.1-8B-Instruct-GGUF:q5_k_m)LLAMA_ARG_N_GPU_LAYERS: The number of layers to run on the GPU (default is 99)
See the llama.cpp documentation for the complete list of server options.
Use the LLAMA_ARG_HF_REPO environment variable to automatically download and use a model from HuggingFace.
The format is <huggingface-repository><:quant> where <:quant> is optional and specifies the quantization to use. For example, to download a model from https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF with no quantization, set the variable to bartowski/Meta-Llama-3.1-8B-Instruct-GGUF. To use the same model with q5_k_m quantization, set the variable to bartowski/Meta-Llama-3.1-8B-Instruct-GGUF:q5_k_m.
Models must be in the GGUF format, which is the default format for llama.cpp models. Models quantized with q5_k_m are recommended for a good balance between speed and accuracy. To list popular models, run ./docker-task --help.
Confused about which model to use? Below is a list of top popular models, ranked by ELO rating. Generally, the higher the ELO rating the better the model. Set LLAMA_ARG_HF_REPO to the repository name to use a specific model.
| Model | Repository | Parameters | Q5_K_M Size | ~ELO | Notes |
|---|---|---|---|---|---|
| gemma-4-31b | bartowski/google_gemma-4-31B-it-GGUF |
31B | 22.61 GB | 1452 | Google's best medium model |
| qwen3.5-27b | bartowski/Qwen_Qwen3.5-27B-GGUF |
27B | 19.95 GB | 1437 | Qwen's best medium model |
| ministral-3-14b-instruct-2512 | bartowski/mistralai_Ministral-3-14B-Instruct-2512-GGUF |
14B | 9.62 GB | 1410 | Mistral AI's best small model |
| deepseek-r1-distill-qwen-14b | bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF |
14B | 10.5 GB | 1375 | Deepseek's best small thinking model |
| llama-3.1-8b | bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |
8B | 5.73 GB | 1211 | Meta's best small model |
Note
Values with + are minimum estimates from previous versions of the model due to missing data.