{"slug":"llama-cpp","title":"llama.cpp","description":"Wiki for llama.cpp setup, inference, quantization, serving, and troubleshooting.","tags":["llama.cpp","inference","quantization"],"category":"inference","scope":{"covers":"llama.cpp build and setup, inference, GGUF quantization, the server, grammars and function-calling, and troubleshooting.","notCovered":"Master changes after the date below and hardware-specific benchmarks — use web search.","currentAs":"2026-05-30 (master (~2026-05))"},"lastUpdated":"2026-05-30","documentCount":63,"raw_base":"/raw/llama-cpp/","html_base":"/wiki/llama-cpp/","documents":[{"path":"README.md","title":"llama.cpp Knowledge Base","type":null,"updated":null},{"path":"wiki/index.md","title":"llama.cpp KB — Master Index","type":"index","updated":"2026-05-30"},{"path":"wiki/concepts/build-and-backends.md","title":"Build and Backends","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/cli-and-tools-reference.md","title":"CLI & Tools Reference — llama-cli, quantize, bench, imatrix","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/embeddings.md","title":"Embeddings","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/function-calling.md","title":"Function Calling (tool calls)","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/gbnf-grammars.md","title":"GBNF Grammars","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/gguf-format.md","title":"GGUF Format","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/imatrix.md","title":"Importance Matrix (imatrix)","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/kv-cache-and-context.md","title":"KV Cache and Context","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/multimodal-mtmd.md","title":"Multimodal (mtmd)","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/quantization.md","title":"Quantization","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/sampling-parameters.md","title":"Sampling Parameters","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/server-api.md","title":"Server API (llama-server REST endpoints)","type":"concept","updated":"2026-05-30"},{"path":"wiki/concepts/speculative-decoding.md","title":"Speculative Decoding","type":"concept","updated":"2026-05-30"},{"path":"wiki/entities/backend-cpu.md","title":"CPU Backend","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/backend-cuda.md","title":"CUDA Backend","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/backend-metal.md","title":"Metal Backend","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/backend-rocm.md","title":"ROCm / HIP Backend","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/backend-vulkan.md","title":"Vulkan Backend","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/binary-imatrix.md","title":"llama-imatrix (binary)","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/binary-llama-bench.md","title":"llama-bench","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/binary-llama-cli.md","title":"llama-cli","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/binary-llama-quantize.md","title":"llama-quantize (binary)","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/binary-llama-server.md","title":"llama-server (binary)","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/binary-mtmd.md","title":"llama-mtmd-cli","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/ggml.md","title":"ggml (Tensor Library)","type":"entity","updated":"2026-05-30"},{"path":"wiki/entities/project-llama-cpp.md","title":"llama.cpp (Project)","type":"entity","updated":"2026-05-30"},{"path":"wiki/log.md","title":"Activity Log","type":"log","updated":null},{"path":"wiki/summaries/cli-readme.md","title":"llama-cli — Usage & Parameters Reference","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-artefact2-quant-table.md","title":"Artefact2's canonical GGUF quant KL-divergence / PPL / bpw table (Mistral-7B)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-arxiv-quant-eval.md","title":"Which Quantization Should I Use? Unified llama.cpp Eval on Llama-3.1-8B-Instruct","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-bartowski-quant-guide.md","title":"Bartowski's 'Which file should I choose?' quant decision guide (Qwen3-8B GGUF)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-bench-apple-silicon.md","title":"Apple Silicon llama-bench Scoreboard (M1–M5, LLaMA 7B, F16/Q8_0/Q4_0)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-bench-nvidia-cuda.md","title":"NVIDIA CUDA llama-bench Scoreboard (Llama 2 7B Q4_0, pp512 vs tg128, FA on/off)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-benches-catalog.md","title":"Community Benchmarks & Quant Guides — Catalog","type":"summary","updated":"2026-06-09"},{"path":"wiki/summaries/community-dgxspark-kv-quant.md","title":"DGX Spark KV-Quant Benchmarks (Mar 2026, build 8399): q4_0 used MORE memory + ~92% slower — use q8_0","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-gh15180-vllm-vs-llamacpp.md","title":"GH #15180: llama.cpp vs vLLM Head-to-Head (RTX 4090, Qwen2.5-3B)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-hf-gguf-usage.md","title":"HF Doc: GGUF usage with llama.cpp (-hf repo-pull, install, server /v1)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-kaitchup-gguf-guide.md","title":"Choosing a GGUF Model: K-Quants, I-Quants, Legacy Formats (Kaitchup)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-mradermacher-imatrix.md","title":"mradermacher i1/imatrix quant card (Phi-4-reasoning-plus): static vs weighted quants","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-pr1684-kquants.md","title":"k-quants PR #1684 — Origin of K-Quant Perplexity Tables (LLaMA-7B)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-redhat-vllm-vs-llamacpp.md","title":"Red Hat: vLLM or llama.cpp - Choosing the Right Inference Engine","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-smcleod-kv-quant.md","title":"KV-Cache Quantisation Quality & VRAM (Sam McLeod, Dec 2024): q8_0 near-lossless","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-steelphoenix-guide.md","title":"Community Guide: SteelPh0enix — llama.cpp from scratch (build, quantize, run)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/community-unsloth-dynamic-ggufs.md","title":"Unsloth Dynamic 2.0 GGUFs — layerwise quant + self-reported KL/MMLU benchmarks (VENDOR)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/docs-build.md","title":"Building llama.cpp Locally (Backend Matrix)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/docs-function-calling.md","title":"Function Calling in llama.cpp","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/docs-install.md","title":"Installing Pre-built llama.cpp (Package Managers)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/docs-multimodal.md","title":"Multimodal Input in llama.cpp (libmtmd) — Models & Usage","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/gguf-spec.md","title":"GGUF File Format Specification","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/grammars-readme.md","title":"GBNF Guide — Constraining Output with Grammars","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/imatrix-readme.md","title":"llama-imatrix Tool README","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/llama-bench-readme.md","title":"llama-bench — Performance Benchmarking Tool","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/llamacpp-readme.md","title":"llama.cpp Project README (Overview)","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/mtmd-readme.md","title":"Multimodal Support Directory (libmtmd) — Architecture & History","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/quantize-readme.md","title":"llama-quantize Tool README","type":"summary","updated":"2026-05-30"},{"path":"wiki/summaries/server-readme.md","title":"llama-server: HTTP Server README","type":"summary","updated":"2026-05-30"},{"path":"wiki/syntheses/customization-and-tuning.md","title":"Customization & Tuning — The llama.cpp Knobs","type":"synthesis","updated":"2026-05-30"},{"path":"wiki/syntheses/llamacpp-vs-ollama.md","title":"llama.cpp vs Ollama","type":"synthesis","updated":"2026-05-30"},{"path":"wiki/syntheses/llamacpp-vs-vllm.md","title":"llama.cpp vs vLLM","type":"synthesis","updated":"2026-05-30"},{"path":"wiki/syntheses/quant-types-compared.md","title":"Quant Types Compared — Which GGUF Quant Should You Pick?","type":"synthesis","updated":"2026-05-30"},{"path":"wiki/syntheses/server-deployment.md","title":"Deploying llama-server as an OpenAI-Compatible API","type":"synthesis","updated":"2026-05-30"}]}