# llama.cpp > Wiki for llama.cpp setup, inference, quantization, serving, and troubleshooting. > Covers: llama.cpp build and setup, inference, GGUF quantization, the server, grammars and function-calling, and troubleshooting. > Not covered: Master changes after the date below and hardware-specific benchmarks — use web search. > Current as of: 2026-05-30 (master (~2026-05)) - [llama.cpp Knowledge Base](/raw/llama-cpp/README.md) - [llama.cpp KB — Master Index](/raw/llama-cpp/wiki/index.md) - [Build and Backends](/raw/llama-cpp/wiki/concepts/build-and-backends.md) - [CLI & Tools Reference — llama-cli, quantize, bench, imatrix](/raw/llama-cpp/wiki/concepts/cli-and-tools-reference.md) - [Embeddings](/raw/llama-cpp/wiki/concepts/embeddings.md) - [Function Calling (tool calls)](/raw/llama-cpp/wiki/concepts/function-calling.md) - [GBNF Grammars](/raw/llama-cpp/wiki/concepts/gbnf-grammars.md) - [GGUF Format](/raw/llama-cpp/wiki/concepts/gguf-format.md) - [Importance Matrix (imatrix)](/raw/llama-cpp/wiki/concepts/imatrix.md) - [KV Cache and Context](/raw/llama-cpp/wiki/concepts/kv-cache-and-context.md) - [Multimodal (mtmd)](/raw/llama-cpp/wiki/concepts/multimodal-mtmd.md) - [Quantization](/raw/llama-cpp/wiki/concepts/quantization.md) - [Sampling Parameters](/raw/llama-cpp/wiki/concepts/sampling-parameters.md) - [Server API (llama-server REST endpoints)](/raw/llama-cpp/wiki/concepts/server-api.md) - [Speculative Decoding](/raw/llama-cpp/wiki/concepts/speculative-decoding.md) - [CPU Backend](/raw/llama-cpp/wiki/entities/backend-cpu.md) - [CUDA Backend](/raw/llama-cpp/wiki/entities/backend-cuda.md) - [Metal Backend](/raw/llama-cpp/wiki/entities/backend-metal.md) - [ROCm / HIP Backend](/raw/llama-cpp/wiki/entities/backend-rocm.md) - [Vulkan Backend](/raw/llama-cpp/wiki/entities/backend-vulkan.md) - [llama-imatrix (binary)](/raw/llama-cpp/wiki/entities/binary-imatrix.md) - [llama-bench](/raw/llama-cpp/wiki/entities/binary-llama-bench.md) - [llama-cli](/raw/llama-cpp/wiki/entities/binary-llama-cli.md) - [llama-quantize (binary)](/raw/llama-cpp/wiki/entities/binary-llama-quantize.md) - [llama-server (binary)](/raw/llama-cpp/wiki/entities/binary-llama-server.md) - [llama-mtmd-cli](/raw/llama-cpp/wiki/entities/binary-mtmd.md) - [ggml (Tensor Library)](/raw/llama-cpp/wiki/entities/ggml.md) - [llama.cpp (Project)](/raw/llama-cpp/wiki/entities/project-llama-cpp.md) - [Activity Log](/raw/llama-cpp/wiki/log.md) - [llama-cli — Usage & Parameters Reference](/raw/llama-cpp/wiki/summaries/cli-readme.md) - [Artefact2's canonical GGUF quant KL-divergence / PPL / bpw table (Mistral-7B)](/raw/llama-cpp/wiki/summaries/community-artefact2-quant-table.md) - [Which Quantization Should I Use? Unified llama.cpp Eval on Llama-3.1-8B-Instruct](/raw/llama-cpp/wiki/summaries/community-arxiv-quant-eval.md) - [Bartowski's 'Which file should I choose?' quant decision guide (Qwen3-8B GGUF)](/raw/llama-cpp/wiki/summaries/community-bartowski-quant-guide.md) - [Apple Silicon llama-bench Scoreboard (M1–M5, LLaMA 7B, F16/Q8_0/Q4_0)](/raw/llama-cpp/wiki/summaries/community-bench-apple-silicon.md) - [NVIDIA CUDA llama-bench Scoreboard (Llama 2 7B Q4_0, pp512 vs tg128, FA on/off)](/raw/llama-cpp/wiki/summaries/community-bench-nvidia-cuda.md) - [Community Benchmarks & Quant Guides — Catalog](/raw/llama-cpp/wiki/summaries/community-benches-catalog.md) - [DGX Spark KV-Quant Benchmarks (Mar 2026, build 8399): q4_0 used MORE memory + ~92% slower — use q8_0](/raw/llama-cpp/wiki/summaries/community-dgxspark-kv-quant.md) - [GH #15180: llama.cpp vs vLLM Head-to-Head (RTX 4090, Qwen2.5-3B)](/raw/llama-cpp/wiki/summaries/community-gh15180-vllm-vs-llamacpp.md) - [HF Doc: GGUF usage with llama.cpp (-hf repo-pull, install, server /v1)](/raw/llama-cpp/wiki/summaries/community-hf-gguf-usage.md) - [Choosing a GGUF Model: K-Quants, I-Quants, Legacy Formats (Kaitchup)](/raw/llama-cpp/wiki/summaries/community-kaitchup-gguf-guide.md) - [mradermacher i1/imatrix quant card (Phi-4-reasoning-plus): static vs weighted quants](/raw/llama-cpp/wiki/summaries/community-mradermacher-imatrix.md) - [k-quants PR #1684 — Origin of K-Quant Perplexity Tables (LLaMA-7B)](/raw/llama-cpp/wiki/summaries/community-pr1684-kquants.md) - [Red Hat: vLLM or llama.cpp - Choosing the Right Inference Engine](/raw/llama-cpp/wiki/summaries/community-redhat-vllm-vs-llamacpp.md) - [KV-Cache Quantisation Quality & VRAM (Sam McLeod, Dec 2024): q8_0 near-lossless](/raw/llama-cpp/wiki/summaries/community-smcleod-kv-quant.md) - [Community Guide: SteelPh0enix — llama.cpp from scratch (build, quantize, run)](/raw/llama-cpp/wiki/summaries/community-steelphoenix-guide.md) - [Unsloth Dynamic 2.0 GGUFs — layerwise quant + self-reported KL/MMLU benchmarks (VENDOR)](/raw/llama-cpp/wiki/summaries/community-unsloth-dynamic-ggufs.md) - [Building llama.cpp Locally (Backend Matrix)](/raw/llama-cpp/wiki/summaries/docs-build.md) - [Function Calling in llama.cpp](/raw/llama-cpp/wiki/summaries/docs-function-calling.md) - [Installing Pre-built llama.cpp (Package Managers)](/raw/llama-cpp/wiki/summaries/docs-install.md) - [Multimodal Input in llama.cpp (libmtmd) — Models & Usage](/raw/llama-cpp/wiki/summaries/docs-multimodal.md) - [GGUF File Format Specification](/raw/llama-cpp/wiki/summaries/gguf-spec.md) - [GBNF Guide — Constraining Output with Grammars](/raw/llama-cpp/wiki/summaries/grammars-readme.md) - [llama-imatrix Tool README](/raw/llama-cpp/wiki/summaries/imatrix-readme.md) - [llama-bench — Performance Benchmarking Tool](/raw/llama-cpp/wiki/summaries/llama-bench-readme.md) - [llama.cpp Project README (Overview)](/raw/llama-cpp/wiki/summaries/llamacpp-readme.md) - [Multimodal Support Directory (libmtmd) — Architecture & History](/raw/llama-cpp/wiki/summaries/mtmd-readme.md) - [llama-quantize Tool README](/raw/llama-cpp/wiki/summaries/quantize-readme.md) - [llama-server: HTTP Server README](/raw/llama-cpp/wiki/summaries/server-readme.md) - [Customization & Tuning — The llama.cpp Knobs](/raw/llama-cpp/wiki/syntheses/customization-and-tuning.md) - [llama.cpp vs Ollama](/raw/llama-cpp/wiki/syntheses/llamacpp-vs-ollama.md) - [llama.cpp vs vLLM](/raw/llama-cpp/wiki/syntheses/llamacpp-vs-vllm.md) - [Quant Types Compared — Which GGUF Quant Should You Pick?](/raw/llama-cpp/wiki/syntheses/quant-types-compared.md) - [Deploying llama-server as an OpenAI-Compatible API](/raw/llama-cpp/wiki/syntheses/server-deployment.md)