# llama.cpp — full corpus # llama.cpp Knowledge Base LLM-maintained research KB on [llama.cpp](https://github.com/ggml-org/llama.cpp) — the C/C++ engine for running LLMs locally. Used as the research backbone for YouTube videos (tutorials, benchmarks, deep dives). ## Structure - `raw/` — immutable source documents (doc mirrors, transcripts, discussion/PR dumps, benchmark logs) - `wiki/` — synthesized knowledge (summaries, concepts, entities, syntheses) Schema and maintenance rules: see `CLAUDE.md`. ## Usage - **Add new sources:** drop them into `raw/` and ask the LLM to "ingest" them - **Ask questions:** the LLM reads the wiki to synthesize answers with links - **Draft video modules:** ask "draft module on " to produce slide + script outlines sourced from the wiki ## Version tracking llama.cpp ships rolling builds (`b####`) rather than semver releases. Each wiki page records the latest build tag it was verified against in its `llama_build` frontmatter field. Latest verified llama.cpp build: **(none yet — scaffold)** Based on the [llm-wiki](https://github.com/tonbistudio/llm-wiki) template / Karpathy's "LLM Wiki" pattern. --- title: "llama.cpp KB — Master Index" type: index updated: 2026-05-30 llama_build: "master (~2026-05)" --- # llama.cpp KB — Master Index **Latest verified llama.cpp build:** master snapshot ~2026-05 (official sources pulled from `master`, not pinned to a `b####` tag) **KB pages:** 58 (28 summaries + 12 concepts + 13 entities + 5 syntheses) Two source tiers: **official** first-party docs (`ggml-org`, `confidence: high`) and **community** sources (blogs, HF cards, GitHub discussions, benchmarks — `confidence: medium`, dated, stored as markdown-converted mirrors under `raw/community/`). ## Summaries ✅ (28/28) ### Official docs (13) - [[summaries/llamacpp-readme]] — main repo README: purpose, features, model & backend support - [[summaries/docs-build]] — full build guide + backend matrix - [[summaries/docs-install]] — package-manager install methods - [[summaries/server-readme]] — `llama-server` + OpenAI/Anthropic-compatible HTTP API - [[summaries/cli-readme]] — `llama-cli` flags and modes - [[summaries/quantize-readme]] — `llama-quantize` + quant type table - [[summaries/imatrix-readme]] — importance-matrix generation - [[summaries/llama-bench-readme]] — `llama-bench` throughput benchmarking - [[summaries/grammars-readme]] — GBNF grammar syntax + structured output - [[summaries/docs-function-calling]] — tool/function calling - [[summaries/docs-multimodal]] — multimodal overview - [[summaries/mtmd-readme]] — `llama-mtmd-cli` / libmtmd multimodal - [[summaries/gguf-spec]] — GGUF file-format specification ### Community — quantization (7) - [[summaries/community-pr1684-kquants]] — k-quants PR #1684 (ikawrakow): origin perplexity tables, "Q6_K ≈ lossless" - [[summaries/community-artefact2-quant-table]] — the canonical KL-divergence-per-quant table (Mistral-7B) - [[summaries/community-arxiv-quant-eval]] — arXiv 2601.14277: unified quant eval on Llama-3.1-8B - [[summaries/community-kaitchup-gguf-guide]] — "Choosing a GGUF Model" taxonomy (Oct 2025) - [[summaries/community-bartowski-quant-guide]] — bartowski's "Which file should I choose?" decision tree - [[summaries/community-mradermacher-imatrix]] — static vs imatrix (i1) quant framing - [[summaries/community-unsloth-dynamic-ggufs]] — Unsloth Dynamic 2.0 (vendor; self-reported) ### Community — benchmarks (4) - [[summaries/community-bench-apple-silicon]] — Apple Silicon llama-bench table (GH #4167) - [[summaries/community-bench-nvidia-cuda]] — NVIDIA CUDA llama-bench table (GH #15013) - [[summaries/community-smcleod-kv-quant]] — KV-cache quantization quality (smcleod) - [[summaries/community-dgxspark-kv-quant]] — DGX Spark KV-quant benchmark (conflicting finding) ### Community — comparisons & guides (4) - [[summaries/community-redhat-vllm-vs-llamacpp]] — vLLM vs llama.cpp (Red Hat; vendor-biased) - [[summaries/community-gh15180-vllm-vs-llamacpp]] — vLLM vs llama.cpp (fair single-GPU test) - [[summaries/community-steelphoenix-guide]] — long-form build/quantize/run guide (HN front page) - [[summaries/community-hf-gguf-usage]] — HF official "GGUF usage with llama.cpp" ## Concepts ✅ (12/12) ### Model format & quantization - [[concepts/gguf-format]] — the GGUF single-file model container + metadata spec - [[concepts/quantization]] — reduced-precision weight formats (Q4_K_M, IQ-series, etc.) - [[concepts/imatrix]] — importance matrix for quality-preserving low-bit quant ### Inference & generation - [[concepts/sampling-parameters]] — the sampler chain and its defaults - [[concepts/kv-cache-and-context]] — context length, KV cache types, RoPE/YaRN - [[concepts/speculative-decoding]] — draft-model / n-gram / Eagle3 / MTP speedups - [[concepts/gbnf-grammars]] — GBNF constrained / structured output - [[concepts/function-calling]] — OpenAI-style tool calling - [[concepts/embeddings]] — embedding & reranking serving - [[concepts/multimodal-mtmd]] — image/audio input via the mtmd library ### Serving & build - [[concepts/server-api]] — `llama-server` REST surface (native + OpenAI + Anthropic) - [[concepts/build-and-backends]] — CMake build flow + the full backend matrix ## Entities ✅ (12/12) ### The project - [[entities/project-llama-cpp]] — llama.cpp itself (ggml-org / Georgi Gerganov) - [[entities/ggml]] — the C tensor library llama.cpp is built on (the "engine under the engine"; GGUF is its format) ### Binaries / tools - [[entities/binary-llama-cli]] — interactive / one-shot inference CLI - [[entities/binary-llama-server]] — HTTP server with OpenAI-compatible API - [[entities/binary-llama-quantize]] — GGUF quantizer - [[entities/binary-imatrix]] — importance-matrix generator - [[entities/binary-llama-bench]] — throughput benchmark - [[entities/binary-mtmd]] — `llama-mtmd-cli` multimodal tool ### Backends - [[entities/backend-cpu]] — x86 / ARM / RISC-V (+ BLAS, Accelerate, KleidiAI, ZenDNN) - [[entities/backend-cuda]] — NVIDIA GPUs (`-DGGML_CUDA=ON`) - [[entities/backend-metal]] — Apple Silicon (default on macOS) - [[entities/backend-vulkan]] — cross-vendor GPU (`-DGGML_VULKAN=ON`) - [[entities/backend-rocm]] — AMD GPUs via HIP (`-DGGML_HIP=ON`) _Backends present in source but not yet given pages: SYCL (Intel), MUSA (Moore Threads), CANN (Ascend), OpenCL (Adreno), RPC, ZenDNN, WebGPU, BLIS, OpenVINO, IBM zDNN, Hexagon. Covered in the matrix on [[concepts/build-and-backends]]._ ## Syntheses ✅ (5/5) - [[syntheses/quant-types-compared]] — which GGUF quant to pick: bits/weight, KL/perplexity, decision tree - [[syntheses/llamacpp-vs-ollama]] — the wrapper relationship (Ollama runs llama.cpp) + when to use which - [[syntheses/llamacpp-vs-vllm]] — batched GPU serving vs portable quantized inference (not apples-to-apples) - [[syntheses/server-deployment]] — `llama-server` as a drop-in OpenAI API (local / source / Docker) - [[syntheses/customization-and-tuning]] — every llama.cpp knob in one map (sampling, KV, grammars, spec-decoding, backends) _Still planned:_ `backend-selection-guide` / a dedicated benchmark synthesis (benchmark summaries are ingested; the user deprioritized a standalone perf synthesis for now). ## Presentations ✅ (4) - [[presentations/standalone-llamacpp-explainer-outline]] — standalone "what is llama.cpp" video (14–20 min): what it is → vs other servers → customization/flags → ways to serve → head-to-head test vs Ollama (memory/speed). Includes a fair-test methodology + pre-record checklist. - `presentations/standalone-llamacpp-explainer-slides.html` — self-contained slide deck for the informational beats of that outline (22 slides, themed to the llama.cpp README palette: charcoal `#1e2228` + orange `#f0883e`). The customization segment is expanded into a per-family deep-dive (map → sampling → structured output → tool calling → context/memory → speed → hardware → "measure" closer). Open in a browser; ← → / Space to navigate, F for fullscreen. Terminal/install demos are intentionally left to live screen capture. - [[presentations/standalone-llamacpp-explainer-script-slides]] — voiceover script for the 22 deck slides (record in one pass). - [[presentations/standalone-llamacpp-explainer-script-demos]] — runbook for the 4 live terminal segments (install, customization, serving, the Ollama head-to-head), with exact commands + talk track + fallbacks. Both scripts share one running order. Ask "draft module on " to generate more. ## Known caveats (read before citing) - **Not pinned to a build.** Sources are from `master`; fast-moving specifics (CLI flag names, server API fields, sampler defaults, multimodal model roster) should be re-verified against a tagged `b####` build before going on camera. Each page carries `llama_build: "master (~2026-05)"`. - **Docs can lag code.** Noted inline where spotted — e.g. the GGUF spec's `general.file_type` enum is stale; `function-calling` doc has a self-noted TODO; multimodal is "under heavy development." - **One known doc inconsistency:** server `/completion` default `repeat_penalty` is `1.1` while CLI `--repeat-penalty` default is `1.00` (see [[concepts/sampling-parameters]]). ### Community-source caveats - **Benchmark numbers are point-in-time + hardware-specific** — always cited with their hardware/build/quant/context. Don't generalize a single number. - **A real, documented conflict:** smcleod (Dec 2024) says q4_0 KV cache saves ~66% VRAM; the DGX Spark forum (Mar 2026, build 8399) found q4_0 KV used *more* memory than f16 at 64K context. Reconciled in [[syntheses/customization-and-tuning]] — it's platform/build-specific; **q8_0 is the safe default** both endorse. - **Vendor bias flagged inline:** Unsloth (Dynamic-2.0 quant wins) and Red Hat (pro-vLLM) are self-interested — claims marked as such. - **Reddit (r/LocalLLaMA) could not be fetched** by the tooling; no threads were fabricated. Paste specific threads if you want them ingested. - **No dedicated Ollama source was mirrored** — [[syntheses/llamacpp-vs-ollama]] rests on the (well-established) wrapper relationship + community consensus, not a benchmarked source. Worth adding one later. - **Two stale-but-canonical references:** the Artefact2 KL table (Feb 2024, Mistral-7B) and the SteelPh0enix guide (late 2024) are widely cited but predate newer quants/flags — verify specifics against current sources. ## Operator ring (added 2026-06-09) - [[concepts/cli-and-tools-reference]] — llama-cli, llama-quantize (ppl/kld), llama-bench, llama-imatrix - [[summaries/community-benches-catalog]] — map of the 15 community benchmarks/quant guides --- title: "Build and Backends" type: concept tags: [build, cpu, cuda, metal, vulkan, rocm, sycl, musa, cann, blas, foundational, developer] created: 2026-05-30 updated: 2026-05-30 sources: [docs-build] confidence: high llama_build: "master (~2026-05)" --- # Build and Backends ## Definition llama.cpp is built with **CMake**, a cross-platform build-system generator. The same C/C++ source compiles for many different hardware **backends** (the code paths that run inference on a specific kind of processor, such as a CPU or a particular vendor's GPU). A key property of the project is that you can compile several backends into a single binary and then pick which one to use at runtime. ## How It Works The canonical build flow is two commands: ``` cmake -B build cmake --build build --config Release ``` The first command configures the build in a `build/` directory; the second compiles it. For faster builds you can add `-j 8` (parallel jobs) or use the Ninja generator, and `ccache` speeds up rebuilds. Build-type notes: - **Debug build:** `-DCMAKE_BUILD_TYPE=Debug` for single-config generators, or `--config Debug` for multi-config generators. - **Static libraries:** `-DBUILD_SHARED_LIBS=OFF`. You enable a backend by passing its CMake flag at configure time. Several backends can be combined in one build, for example `-DGGML_CUDA=ON -DGGML_VULKAN=ON`. At runtime you select the device: - `--device` / `--list-devices` to choose or list devices. - `--device none` or `-ngl 0` to disable the GPU. - `GGML_BACKEND_DL` enables dynamic backend loading (loading backends as plugins). > Note: **ROCm** is AMD's SDK, while **HIP** is the build flag. The older `-DGGML_HIPBLAS` flag is superseded by `-DGGML_HIP=ON`. ## Key Parameters The backend matrix below shows the primary backends and their enable flags. SYCL, MUSA, and CANN are present with the flags shown, but their detailed documentation lives in `docs/backend/*.md` files that are not yet ingested here (confidence high for the flag, medium for deeper detail). | Backend | Hardware | CMake flag | |---|---|---| | CPU | x86 (AVX/AVX2/AVX512/AMX), ARM (NEON), RISC-V (RVV) | default (no flag) | | CUDA | NVIDIA GPU | `-DGGML_CUDA=ON` | | Metal | Apple Silicon GPU | default-on macOS (`-DGGML_METAL=OFF` to disable) | | Vulkan | Cross-vendor GPU (AMD/NVIDIA/Intel) | `-DGGML_VULKAN=ON` | | ROCm / HIP | AMD GPU | `-DGGML_HIP=ON` | | SYCL | Intel GPU | `-DGGML_SYCL=ON` | | MUSA | Moore Threads GPU | `-DGGML_MUSA=ON` | | CANN | Ascend NPU | `-DGGML_CANN=on` | | BLAS | CPU acceleration | `-DGGML_BLAS=ON` (with `-DGGML_BLAS_VENDOR=...`) | | ZenDNN | AMD CPU | `-DGGML_ZENDNN=ON` | | KleidiAI | Arm CPU | `-DGGML_CPU_KLEIDIAI=ON` | | OpenCL | Adreno GPU | `-DGGML_OPENCL=ON` | | WebGPU | WebGPU | `-DGGML_WEBGPU=ON` | | RPC | Remote backends | `-DGGML_RPC=ON` | The full backend list that llama.cpp supports (from the README table) also includes: BLIS, OpenVINO [in progress] (Intel CPU/GPU/NPU), IBM zDNN (IBM Z), Hexagon [in progress] (Snapdragon), and VirtGPU. ## When To Use Build from source whenever you need a backend or option not provided by a prebuilt install, when you want to combine multiple backends in one binary, or when you need to tune for specific hardware. For prebuilt convenience, install methods exist instead (see [[entities/project-llama-cpp]]). ## Risks & Pitfalls - Forgetting `--config Release` (on multi-config generators) yields slow debug builds. - Mixing up ROCm (the SDK) and HIP (the flag); use `-DGGML_HIP=ON`, not the deprecated `-DGGML_HIPBLAS`. - Combining GPU backends on platforms with conflicts (for example Vulkan on macOS requires `-DGGML_METAL=OFF`). ## Related Concepts - [[entities/project-llama-cpp]] - [[entities/backend-cpu]] - [[entities/backend-cuda]] - [[entities/backend-metal]] - [[entities/backend-vulkan]] - [[entities/backend-rocm]] ## Sources - docs-build --- title: "CLI & Tools Reference — llama-cli, quantize, bench, imatrix" type: concept tags: [cli, tools, quantize, bench, imatrix] updated: 2026-06-09 confidence: high sources: [raw/cli-readme.md, raw/quantize-readme.md, raw/llama-bench-readme.md, raw/imatrix-readme.md] --- # CLI & Tools Reference The four core command-line tools (each tool's full auto-generated flag table lives in its cited raw README). ## llama-cli The main inference CLI. Common params include `-h/--help/--usage`, `--version` (build info), `-cl/--cache-list` (models in cache) — plus the full sampling/context/model parameter set (auto-generated table in `raw/cli-readme.md`). ## llama-quantize Converts a high-precision GGUF (F32/BF16) to a quantized format: ``` ./llama-quantize [options] input-model-f32.gguf [output-model-quant.gguf] type [threads] ``` Accuracy loss is measured in **perplexity (ppl)** and/or **KL-divergence (kld)** — and can be minimized with a suitable **imatrix** file (below). No-setup alternative: the GGUF-my-repo HF space (synced from llama.cpp main every 6 hours). ## llama-bench The performance-testing tool: text generation vs prompt processing across models, batch sizes, thread counts, and GPU offload layer counts; multiple output formats for comparisons. See [[summaries/community-benches-catalog]] for measured results on common hardware. ## llama-imatrix Computes an **importance matrix** from a model + text dataset, used during quantization to improve quantized quality (PR #4861): ``` ./llama-imatrix -m model.gguf -f some-text.txt [-o imatrix.gguf] [--output-format {gguf,dat}] ... ``` Supports chunking, merging previous matrices (`--in-file ...`), and `--show-statistics`. Pairs with [[concepts/imatrix]] (the concept) and [[concepts/quantization]]. --- title: "Embeddings" type: concept tags: [embeddings, server-api, api, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: [raw/server-readme.md] confidence: high llama_build: "master (~2026-05)" --- # Embeddings ## Definition **Embeddings** are fixed-length numeric vectors that represent a piece of text. Texts with similar meaning produce vectors that are close together, which makes embeddings the foundation for **retrieval**, **RAG** (retrieval-augmented generation, where relevant documents are fetched and added to a prompt), and **similarity** search. [[binary-llama-server|llama-server]] can produce embeddings when loaded with an embedding model. ## How It Works Run the server in embedding mode and it loads an embedding model instead of (or as) a generative one, then returns a vector for each input. A transformer produces one vector per token; a **pooling** step combines those per-token vectors into a single fixed-length vector for the whole text. The pooling strategy is chosen with `--pooling`: - `none` — no pooling; return per-token vectors - `mean` — average of token vectors - `cls` — use the special classification token's vector - `last` — use the last token's vector - `rank` — used for reranking (see below) Vectors are normalized according to `--embd-normalize N` (default `2`, meaning L2 normalization — scaling the vector to unit length so similarity comparisons are consistent). Two endpoints serve embeddings: | Endpoint | Family | Pooling requirement | Output | |---|---|---|---| | `/v1/embeddings` | OpenAI-compatible | pooling != `none` | L2-normalized vectors | | `/embeddings` | Native | supports `--pooling none` | per-token, unnormalized when pooling is `none` | Request fields include `input` / `content`, `embd_normalize`, and `encoding_format`. ### Reranking **Reranking** scores how relevant each candidate document is to a query (used to reorder search results). Run with `--pooling rank` and post to `/reranking` (also reachable as `/v1/rerank`). ## Key Parameters | Flag / Field | Default | Meaning | |---|---|---| | `--embedding` / `--embeddings` | — | Put the server in embed-only mode | | `--pooling {none,mean,cls,last,rank}` | — | How per-token vectors are combined | | `--embd-normalize N` | `2` (L2) | Vector normalization mode | | `input` / `content` | — | Text to embed (request body) | | `embd_normalize` | — | Per-request normalization override | | `encoding_format` | — | Output encoding of the vectors | ## When To Use - Building a RAG pipeline or semantic search index. - Measuring similarity or clustering text. - Reranking retrieved documents against a query (`--pooling rank`). - Use `/v1/embeddings` for OpenAI SDK compatibility; use native `/embeddings` with `--pooling none` when you need raw per-token vectors. ## Risks & Pitfalls - `/v1/embeddings` requires a pooling mode other than `none` — requesting per-token output there will not work; use the native `/embeddings` endpoint instead. - Mixing normalized and unnormalized vectors in the same index breaks similarity comparisons; keep `--embd-normalize` consistent. - An embedding model is not a chat model; embed-only mode does not generate text. ## Related Concepts - [[server-api]] — the endpoints that serve embeddings and reranking - [[binary-llama-server]] — the binary that runs them - [[gguf-format]] — the format the embedding model is loaded from ## Sources - server-readme --- title: "Function Calling (tool calls)" type: concept tags: [function-calling, server-api, api, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: [raw/docs-function-calling.md, raw/server-readme.md] confidence: high llama_build: "master (~2026-05)" --- # Function Calling (tool calls) ## Definition **Function calling** (also called **tool calling**) lets a model emit structured `tool_calls` instead of, or in addition to, plain text. The client passes an OpenAI-style `tools` array describing functions the model may invoke; when the model decides to call one, it returns the function name and a JSON arguments object that the client executes and feeds back. In llama.cpp this is implemented in `common/chat.h` (introduced in PR #9639) and is **enabled by the `--jinja` flag** on [[binary-llama-server|llama-server]]. ## How It Works With `--jinja` active, the server applies the model's chat template (Jinja is the templating language used to format prompts) and parses the model's output back into structured tool calls. llama.cpp ships **native handlers** for the tool-calling conventions of several model families: - Llama 3.x - Functionary v3.1 / v3.2 - Hermes 2 / 3 - Qwen 2.5 / Qwen2.5-Coder - Mistral Nemo - Firefunction v2 - Command R7B - DeepSeek R1 When no native handler matches, a **Generic fallback** is used (server logs show `Chat format: Generic`). A typical start command: ```sh llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M ``` ### Request and response shape The request uses the standard OpenAI `tools` schema. Each tool's `function` has `name`, `description`, and `parameters` (a JSON Schema object: `{type:object, properties, required}`). When the model invokes a tool, the response has: - `finish_reason`: `"tool"` - `message.tool_calls`: a list of `{name, arguments}` where `arguments` is a **JSON string** - `message.content`: `null` ## Key Parameters | Field / Flag | Where | Effect | |---|---|---| | `--jinja` | server flag | Required; enables templating and tool calling | | `tools` | request body | OpenAI tools array (function definitions) | | `parallel_tool_calls` | request body | `true` allows multiple tool calls at once (off by default) | | `--chat-template-file` | server flag | Override a buggy or missing built-in template | | `chat_template` / `chat_template_tool_use` | `/props` response | Inspect to confirm the loaded template supports tools | To check whether the active template supports tool use, query `GET /props` and inspect `chat_template` and `chat_template_tool_use`. For models with a broken or missing template, supply your own with `--chat-template-file` (for example, DeepSeek R1 uses `models/templates/llama-cpp-deepseek-r1.jinja`). ## When To Use - You want the model to call APIs, run code, query databases, or otherwise produce machine-actionable output. - You are building an agent or assistant that orchestrates external tools. - You need multiple simultaneous calls in one turn — set `parallel_tool_calls:true`. ## Risks & Pitfalls - Extreme KV cache quantization (`-ctk q4_0`) degrades tool-calling quality. - DeepSeek R1 native tool calling is a work in progress and the model can be reluctant to emit tool calls. - The function-calling doc carries a TODO noting that the `minja` dependency was removed, so its model/template mapping table may be stale — verify support via `/props` rather than trusting the table. - Tool calling only works with `--jinja`; without it, the server will not parse tool calls. ## Related Concepts - [[server-api]] — the chat endpoint that accepts `tools` - [[binary-llama-server]] — the binary to run with `--jinja` - [[gbnf-grammars]] — related approach to constraining structured output - [[sampling-parameters]] — controls generation that produces the calls ## Sources - docs-function-calling - server-readme --- title: "GBNF Grammars" type: concept tags: [grammars, structured-output, sampling, llama-server, llama-cli, well-established, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: [raw/grammars-readme.md, raw/server-readme.md] confidence: high llama_build: "master (~2026-05)" --- # GBNF Grammars ## Definition GBNF (short for "GGML BNF") is a grammar format used by llama.cpp to constrain model output. It is a variant of BNF (Backus-Naur Form, a classic notation for describing the syntax of a language). When you attach a GBNF grammar to a generation request, every token the model produces must keep the running text matching the grammar. This guarantees structurally valid output — for example, well-formed JSON — rather than merely asking the model to produce it and hoping it complies. ## How It Works A grammar is a set of **production rules** of the form `nonterminal ::= sequence`. The rule named `root` is the start symbol where matching begins. Non-terminals are written as dashed-lowercase names (for example `move` or `key-facts-kv`). The right-hand side of a rule can contain: - **Terminals**: literal strings such as `"O-O"`, or character classes / ranges such as `[1-9]`, `[NBKQR]`, and negated classes such as `[^\n]` (any character except newline). - **Alternatives** with `|` (one of several acceptable sequences). - **Grouping** with `()`. - **Repetition**: `*` (zero or more), `+` (one or more), `?` (zero or one), and exact counts `{m}`, `{m,}`, `{m,n}`, `{0,n}`. - **Unicode escapes**: `\xXX`, `\uXXXX`, `\UXXXXXXXX`. - **Comments** introduced with `#`. - **Token-level matching**: `<[id]>` matches a specific token by its numeric token ID, `` matches a token by its string (which must be a single vocabulary token), and a `!` prefix negates a token match. Here is a verbatim sample grammar (a simplified chess move list) from the docs: ```gbnf root ::= ( "1. " move " " move "\n" ([1-9] [0-9]? ". " move " " move "\n")+ ) move ::= (pawn | nonpawn | castle) [+#]? ``` And a verbatim minimal "bulleted list" grammar: ```gbnf root ::= ("- " item)+ item ::= [^\n]+ "\n" ``` A key behavior to understand: the grammar (or JSON schema) is **not injected into the prompt**; it only constrains sampling. The exception is tool-calling, where the schema does become part of the prompt. The model is never told what the grammar is — it is simply prevented from emitting tokens that would violate it. ## Key Parameters On [[binary-llama-cli]] and [[binary-llama-server]]: - `--grammar` — pass an inline grammar string. - `--grammar-file FNAME` — load a grammar from a file. Ready-made grammars ship in the `grammars/` directory. Verbatim invocation: ``` ./llama-cli -m --grammar-file grammars/some-grammar.gbnf -p 'Some prompt' ``` - `-j` / `--json-schema` (CLI) — supply a JSON Schema instead of raw GBNF; it is converted to a grammar automatically. On the server / [[server-api]]: - `grammar` request field — raw GBNF. - `json_schema` request field — a JSON Schema. - `response_format` on `/v1/chat/completions` — `{"type":"json_object"}` or `{"type":"json_schema","schema":{...}}`. There is also an ahead-of-time converter, `examples/json_schema_to_grammar.py`, for turning a JSON Schema into GBNF outside the request path. ## When To Use Use GBNF grammars whenever you need output that is guaranteed to parse: emitting JSON for a downstream program, restricting answers to a fixed enumerated set, or forcing a domain-specific notation (chess moves, dates, ID formats). For JSON specifically, prefer the JSON-Schema path (`-j` / `json_schema` / `response_format`) so you can describe the shape declaratively. Grammars are also the foundation underneath [[function-calling]], where the tool schema constrains the model's arguments. They work alongside ordinary [[sampling-parameters]] — the grammar restricts which tokens are *allowed*, and sampling then chooses among them. ## Risks & Pitfalls - **Performance with many optional repetitions.** Patterns like `x? x? x?...` are slow; prefer a bounded count such as `x{0,N}`. This is a known performance gotcha tracked in issue #4218. - **JSON-Schema feature gaps.** Many schema features are unsupported or only partially work: `patternProperties` is not supported; `prefixItems` and nested `$ref` are partially broken; numeric bounds are integer-only. - **`additionalProperties` defaults to `false`.** The JSON-schema converter sets `additionalProperties: false` by default for faster grammars and reduced hallucination. Setting it to `true` may produce keys containing unescaped newlines. - **Recent token-matching syntax.** The token-level matching constructs (`<[id]>`, ``, the `!` negation prefix) are a relatively recent addition — verify they exist in your target build before relying on them. - The grammar does not steer *meaning*, only *form*: it cannot make the model's content correct, only its structure valid. ## Related Concepts - [[function-calling]] - [[server-api]] - [[sampling-parameters]] - [[binary-llama-cli]] ## Sources - grammars-readme - server-readme --- title: "GGUF Format" type: concept tags: [gguf, ggml, foundational, well-established] created: 2026-05-30 updated: 2026-05-30 sources: [raw/gguf-spec.md] confidence: high llama_build: "master (~2026-05)" --- # GGUF Format ## Definition GGUF (GGML Universal File) is a single-file binary container for GGML models. A single file holds everything needed to load and run a model: a header, typed key-value metadata, tensor descriptions, and the tensor data itself. It is the successor to the older GGML and GGJT formats, and it is designed for three things: fast loading via memory mapping (mmap), extensibility (new metadata can be added without breaking old readers), and simple single-file distribution. ## How It Works A GGUF file is laid out in a fixed order: ``` header -> gguf_tensor_info_t[] -> padding -> tensor_data[] ``` The header (`gguf_header_t`) begins with the magic bytes `0x47 0x47 0x55 0x46` (the ASCII text "GGUF"), followed by a `uint32` version, a `uint64` tensor count, a `uint64` metadata key-value count, and then the array of key-value pairs. The current spec version is 3. **Metadata** is a list of typed key-value pairs. Value types are given by an enum: `UINT8=0`, `INT8`, `UINT16`, `INT16`, `UINT32`, `INT32=5`, `FLOAT32=6`, `BOOL=7`, `STRING=8`, `ARRAY=9`, `UINT64=10`, `INT64=11`, `FLOAT64=12`. Strings are UTF-8 with a `uint64` length prefix — they are NOT null-terminated. Keys use ASCII `lower_snake_case` with a hierarchical dotted style (for example `general.architecture`), and may be up to 65535 bytes long. **Alignment**: tensor data is aligned to a boundary set by `general.alignment` (a `uint32`, always a multiple of 8, default 32). The padding before the tensor data block satisfies `align_offset(offset) = offset + (ALIGN - (offset % ALIGN)) % ALIGN`. **Tensors** are named (up to 64 bytes) and may have up to 4 dimensions. Standard tensor names include `token_embd`, `output`, and `output_norm`; per-block tensors such as `blk.N.attn_q` / `attn_k` / `attn_v` (or a fused `attn_qkv`), `attn_output`, `attn_norm`, `ffn_up` / `ffn_gate` / `ffn_down`, and `ffn_norm`. Mixture-of-Experts (MoE) models add `ffn_gate_inp` and `ffn_*_exp`; state-space (SSM) models add `ssm_in` / `ssm_conv1d` / `ssm_x` / `ssm_a` / `ssm_d` / `ssm_dt` / `ssm_out`. ## Key Parameters **Required / common metadata keys:** | Key | Notes | | --- | --- | | `general.architecture` | Required | | `general.quantization_version` | Required if the model is quantized | | `general.alignment` | Optional, default 32 | | `general.name`, `general.file_type` | Common | | `[llm].context_length`, `[llm].embedding_length`, `[llm].block_count` | Common architecture params | | `[llm].attention.head_count` | Common | | `tokenizer.ggml.*` | model, tokens, scores, token_type, merges, bos/eos/etc ids | **Tokenizer `token_type` codes:** `1=normal`, `2=unknown`, `3=control`, `4=user-defined`, `5=unused`, `6=byte`. **Filename convention:** `[].gguf`. Sidecars include `mmproj` and `mtp`; the Type field can be `LoRA` or `vocab`; shards are numbered `00001-of-NNNNN` (5 digits, starting at 1). ## When To Use GGUF is the native model format for the llama.cpp toolchain. You produce a GGUF when converting a model from another format (for example with `convert_hf_to_gguf.py`) and consume GGUF files in every runtime and tool. Quantizing a model means producing a new GGUF from an existing one — see [[concepts/quantization]] and [[entities/binary-llama-quantize]]. ## Risks & Pitfalls - **Endianness gap:** version v1 was the initial format, v2 widened `uint32` length fields to `uint64`, and v3 added big-endian support. Little-endian is the default, but there is currently NO in-file flag to detect endianness — a known gap. - **Stale `general.file_type` enum (spec doc lags code):** the spec's `general.file_type` enum is out of date. It stops at `MOSTLY_Q6_K=18` and omits the IQ-series, `Q8_K`, `BF16`, `TQ`, and `MXFP4` types that exist in the live `ggml_type` enum. Rely on the code, not the spec table, for the full type list. - **TODO sections (spec doc lags code):** several spec sections — LoRA metadata, computation graph, and prompting — are marked TODO and are not yet authoritative. ## Related Concepts - [[concepts/quantization]] — quantized models are stored as GGUF and record `general.quantization_version`. - [[concepts/imatrix]] — importance data that improves quantized GGUF quality. ## Sources - [[summaries/gguf-spec]] --- title: "Importance Matrix (imatrix)" type: concept tags: [imatrix, quantization, accuracy, advanced] created: 2026-05-30 updated: 2026-05-30 sources: [raw/imatrix-readme.md, raw/quantize-readme.md] confidence: high llama_build: "master (~2026-05)" --- # Importance Matrix (imatrix) ## Definition An importance matrix (imatrix) is a set of per-weight importance statistics. It is gathered by running the full-precision (f16) model over a body of calibration text and recording which weights matter most. During [[concepts/quantization]], [[entities/binary-llama-quantize]] uses these statistics to preserve the most important weights, which markedly improves quality at low bit widths — especially for the IQ (i-quant) types. ## How It Works The matrix is produced by [[entities/binary-imatrix]] running over a calibration corpus, and is later consumed by [[entities/binary-llama-quantize]] via `--imatrix`. Statistics are computed on **squared activations**. Reported quantities (available via `--show-statistics`) include: - **Sum(Act^2)** — sum of squared activations. - **%Active** — fraction of activations above a `1e-5` threshold. - **Entropy / E(norm)** — activation entropy. - **ZD Score** — see arXiv 2406.17415. - **CosSim** — cosine similarity versus the prior layer. The default output format is GGUF. A legacy `dat` format is available via `--output-format dat` or by using a non-`.gguf` extension, and conversion is bidirectional. Multiple matrices can be merged by passing `--in-file` repeatedly. ## Key Parameters - **`--imatrix FILE`** (on `llama-quantize`) — the file that consumes the matrix during quantization. - **`--process-output`** (default `false`) — whether to apply the imatrix to `output.weight`. It is usually better NOT to, hence the default. - **`--output-format {gguf,dat}`** — output format selection; GGUF is the default. - **`--in-file`** — repeatable, merges multiple matrices. ## When To Use Compute an imatrix before quantizing to a low-bit type — it is effectively required for good IQ results and helps minimize both Perplexity (ppl) and KL-Divergence (kld). Larger, more representative calibration data yields a better matrix; a few hundred KB of varied text is a common choice. ## Risks & Pitfalls - A small or unrepresentative calibration corpus produces a weaker matrix — use varied text. - Applying the imatrix to `output.weight` is usually counterproductive; leave `--process-output` at its default of `false` unless you have a reason not to. ## Related Concepts - [[concepts/quantization]] — the process that consumes the imatrix. - [[entities/binary-imatrix]] — the tool that produces the matrix. - [[entities/binary-llama-quantize]] — the tool that applies it via `--imatrix`. ## Sources - [[summaries/imatrix-readme]] - [[summaries/quantize-readme]] --- title: "KV Cache and Context" type: concept tags: [kv-cache, context, rope, well-established, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: [raw/cli-readme.md, raw/server-readme.md] confidence: high llama_build: "master (~2026-05)" --- # KV Cache and Context ## Definition The **KV cache** (key/value cache) stores the key and value tensors computed for past tokens so that text generation can proceed incrementally — each new token reuses cached state instead of recomputing the whole sequence. **Context length** (`n_ctx`) is the maximum number of tokens the model can attend to at once. The size of the KV cache scales with the context length and the model. ## How It Works As the model processes a prompt and then generates tokens one at a time, it appends each token's key/value tensors to the KV cache. Future tokens attend to this cached history, which is why generation is incremental and fast. When the cache fills to the context limit, you must either stop, shift the window, or use a context-extension technique. **RoPE (rotary position embeddings)** encode token positions inside attention. **YaRN** is a scaling method that extends usable context beyond the model's original training length. GGUF model files store `rope.*` metadata (such as `freq_base`, `scaling.type`/`factor`, and `original_context_length`) that informs these settings. ## Key Parameters **Context size and generation** - `-c` / `--ctx-size` — context length; `0` takes it from the model's training context. - `-n` / `--predict` — number of tokens to generate; `-1` = infinite. **KV cache data type and placement** - `-ctk` / `--cache-type-k` and `-ctv` / `--cache-type-v` — KV cache data type, default `f16`. Allowed: `f32`, `f16`, `bf16`, `q8_0`, `q4_0`, `q4_1`, `iq4_nl`, `q5_0`, `q5_1`. Quantizing the cache (e.g. `q8_0`) cuts memory at some quality cost. - `-kvo` / `--kv-offload` vs `-nkvo` / `--no-kv-offload` — default is to offload the cache to the GPU. - `-cram` / `--cache-ram` — `8192` MiB. - `-kvu` / `--kv-unified`. **Context management** - `--context-shift` / `--no-context-shift` — sliding the window when full; **disabled by default**. - `--swa-full` and `-ctxcp` / `--ctx-checkpoints` (`32`) — for sliding-window- attention (SWA) models. - `--cache-prompt` + `--cache-reuse N` — prompt caching (server). **RoPE / context extension** - `--rope-scaling {none,linear,yarn}`, `--rope-scale`, `--rope-freq-base`, `--rope-freq-scale`. - `--yarn-orig-ctx`, `--yarn-ext-factor` (`-1.00`), `--yarn-attn-factor`, `--yarn-beta-slow`, `--yarn-beta-fast`. ## When To Use - Lower the KV cache precision (`-ctk`/`-ctv q8_0`) to fit longer contexts or larger models in limited VRAM/RAM. - Enable `--context-shift` for endless generation that should keep going past the context limit by sliding the window. - Use the RoPE / YaRN flags to run a model beyond its trained context length. ## Risks & Pitfalls - Quantizing the KV cache reduces memory but costs some quality. - `--context-shift` is **off by default**; long runs will otherwise stop or error when the context fills. - `-dt` / `--defrag-thold` is **deprecated**. - Pushing context far past training length via RoPE/YaRN can degrade quality if scaling parameters are mis-set. ## Related Concepts - [[sampling-parameters]] — operates on the tokens within this context. - [[speculative-decoding]] — also accelerates generation, complementary to KV caching. - [[gguf-format]] — stores the `rope.*` metadata that drives context extension. - [[binary-llama-cli]] — sets these flags in practice. ## Sources - cli-readme - server-readme --- title: "Multimodal (mtmd)" type: concept tags: [multimodal, vision, audio, llama-server, llama-cli, experimental, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: [raw/docs-multimodal.md, raw/mtmd-readme.md] confidence: medium llama_build: "master (~2026-05)" --- # Multimodal (mtmd) ## Definition **mtmd** (the `libmtmd` library) is llama.cpp's modern multimodal support layer — the part of the project that lets a language model accept images and audio in addition to text. It replaces the older `llava.cpp` / `clip.cpp` example stack and consolidates the previous per-model command-line tools (`llava-cli`, `qwen2vl-cli`, `minicpmv-cli`, `gemma3-cli`) into a single `mtmd-cli` (introduced in PRs #12849 and #13012). It is built on top of `clip.cpp`, and its API is inspired by the HuggingFace Transformers `Processor` abstraction. Image input is **stable**; audio input is **experimental**. > **Caveat:** multimodal support in llama.cpp is under very heavy development. Breaking changes are expected, and the binary names, flags, and supported-model roster described here are point-in-time for master (~2026-05). ## How It Works A multimodal setup requires **two GGUF files** (see [[gguf-format]]): 1. The **language model** itself. 2. An **mmproj** file — a *multimodal projector* that encodes images or audio into embeddings the language model can consume. The projector turns pixels (or audio) into the same kind of vectors the LLM normally receives from text tokens, so the model can "read" the media as part of its context. You can let llama.cpp fetch both pieces automatically from a model repo, or point at each file explicitly. ## Key Parameters Run a multimodal model in one of two ways: - `-hf ` — auto-downloads the model **and** a matching mmproj projector. - `-m model.gguf --mmproj projector.gguf` — supply both files explicitly. Relevant flags: - `--mmproj FILE` — path to the projector file. - `--mmproj-url` — fetch the projector from a URL. - `--no-mmproj` — disable multimodal on an `-hf` model. - `--no-mmproj-offload` — keep the projector on the CPU (by default it is offloaded to the GPU). - `-c` — some models need a large context, e.g. `-c 8192`. To produce an mmproj file yourself, convert with `convert_hf_to_gguf.py --mmproj`. The tools that support multimodal input are [[binary-mtmd]] (`llama-mtmd-cli`) and [[binary-llama-server]]. ## When To Use Use mtmd whenever you need a vision- or audio-capable model: describing or answering questions about images, OCR-style tasks, or speech/audio understanding. For interactive one-off use on the command line, reach for `llama-mtmd-cli`; for serving multimodal requests over HTTP, use `llama-server`. On the [[server-api]], send images through `/v1/chat/completions` using `image_url` content parts (base64 or URL), or use the native `/completion` endpoint with `{"prompt_string", "multimodal_data":["base64"]}` plus a media marker in the prompt. Check the `modalities` / `multimodal` capability in `/props` or `/v1/models` to confirm a running server actually has a projector loaded. ## Supported Multimodal Families Confidence on this roster is **medium** — it churns frequently. | Modality | Model families | |---|---| | Vision | Gemma 3 (4b/12b/27b) & Gemma 4, SmolVLM / SmolVLM2, Pixtral 12B, Qwen2-VL / Qwen2.5-VL, Mistral Small 3.1, InternVL 2.5/3, Llama 4 Scout, Moondream2, MiniCPM-V/o, MobileVLM, GLM-Edge, Granite Vision | | Audio | Ultravox 0.5, Voxtral, Qwen3-ASR | | Mixed audio + vision | Qwen2.5-Omni, Qwen3-Omni | ## Risks & Pitfalls - **Heavy development.** Expect breaking changes; pin to a known build for production use. - **Two-file requirement.** Forgetting the mmproj, or pairing a mismatched projector with a model, will fail or misbehave — prefer `-hf` so the matching projector is fetched automatically. - **Audio is experimental.** Treat audio support as unstable. - **Context size.** Some models need a noticeably larger `-c` (e.g. 8192) than their text-only counterparts. - **Capability detection.** Always confirm via `/props` or `/v1/models` that the server reports multimodal capability before sending media. ## Related Concepts - [[binary-mtmd]] - [[server-api]] - [[gguf-format]] ## Sources - docs-multimodal - mtmd-readme --- title: "Quantization" type: concept tags: [quantization, ggml, memory, accuracy, well-established, llama-quantize] created: 2026-05-30 updated: 2026-05-30 sources: [raw/quantize-readme.md, raw/gguf-spec.md] confidence: high llama_build: "master (~2026-05)" --- # Quantization ## Definition Quantization is the practice of storing a model's weights at a reduced number of bits per weight. This shrinks the model on disk and in memory and can speed up inference, at the cost of some accuracy. In llama.cpp the set of available quantization types is defined by the `ggml_type` enum, and quantized models are distributed in the [[concepts/gguf-format]] container. ## How It Works A model usually starts in full precision (16-bit float). Quantization replaces those weights with lower-bit representations, grouped into families with different size/quality trade-offs: - **Legacy:** `Q4_0`, `Q4_1`, `Q5_0`, `Q5_1`, `Q8_0`. - **k-quants:** `Q2_K`, `Q3_K_S/M/L`, `Q4_K_S/M`, `Q5_K_S/M`, `Q6_K`, `Q8_K`. These are mixed-precision, per-block "K" quants; the `_S` / `_M` / `_L` suffixes mean small / medium / large mixes. - **i-quants (IQ):** `IQ1_S`, `IQ1_M`, `IQ2_XXS/XS/S/M`, `IQ3_XXS/XS/S/M`, `IQ4_XS`, `IQ4_NL`. These use codebooks and are the best option at very low bit widths, but they need an importance matrix to work well. - **Other:** `BF16`, `F16`, `TQ1_0` / `TQ2_0` (ternary), and `MXFP4` (used by gpt-oss). - **Removed legacy types:** `Q4_2`, `Q4_3`, and the repacked `*_4_4` / `*_4_8` / `*_8_8` variants. The full workflow is: convert an HF model to an f16/bf16 GGUF (`convert_hf_to_gguf.py`), optionally compute an importance matrix, then run [[entities/binary-llama-quantize]] to produce the target type. GPU/CPU support for each type depends on the backend — see [[concepts/build-and-backends]]. ## Key Parameters - **Recommended default:** `Q4_K_M` gives the best size/quality balance in the official examples. - **`--pure`** disables the k-quant mixtures and uses a single uniform type instead. - **Importance matrix:** low-bit (especially IQ) quality is markedly improved by an importance matrix — see [[concepts/imatrix]]. **Bits/weight (Llama-3.1-8B):** | Type | Bits/weight | | --- | --- | | IQ1_S | 2.00 | | Q2_K | ~3.35 | | IQ4_NL | 4.68 | | Q4_K_M | 4.89 | | Q5_K_M | 5.70 | | Q6_K | 6.56 | | Q8_0 | 8.50 | | F16 | 16.00 | **Size mapping (Llama 3.1, f16 -> Q4_K_M):** | Model | f16 | Q4_K_M | | --- | --- | --- | | 8B | 32.1 GB | 4.9 GB | | 70B | 280.9 GB | 43.1 GB | | 405B | 1625.1 GB | 249.1 GB | ## When To Use Quantize whenever a full-precision model is too large to fit in available memory or too slow, and some accuracy loss is acceptable. `Q4_K_M` is the usual starting point; drop to IQ types (with an imatrix) only when you need to fit into very tight memory budgets. ## Risks & Pitfalls - Quantization always trades away some accuracy. Measure the loss with **Perplexity (ppl)** and **KL-Divergence (kld)**; both are minimized by using an [[concepts/imatrix]]. - The lowest-bit IQ types degrade badly without an importance matrix — do not ship them blind. - `--pure` removes the per-tensor precision mixing that makes k-quants robust, so use it deliberately. ## Related Concepts - [[concepts/gguf-format]] — the container quantized models live in. - [[concepts/imatrix]] — importance data that preserves accuracy at low bit widths. - [[concepts/build-and-backends]] — backend support determines which types run where. - [[entities/binary-llama-quantize]] — the tool that performs quantization. ## Sources - [[summaries/quantize-readme]] - [[summaries/gguf-spec]] --- title: "Sampling Parameters" type: concept tags: [sampling, llama-cli, well-established, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: [raw/cli-readme.md, raw/server-readme.md] confidence: high llama_build: "master (~2026-05)" --- # Sampling Parameters ## Definition Sampling is how llama.cpp chooses the next token from the model's probability distribution over its vocabulary. Rather than always picking the single most likely token, sampling lets you trade off determinism against diversity. In llama.cpp this is done by applying a configurable *chain* of sampler stages, one after another, where each stage filters or reshapes the candidate token set before the final token is drawn. ## How It Works The model produces a raw score (a "logit") for every token in its vocabulary. The sampler chain transforms those logits/probabilities step by step in a fixed order, and the final stage picks a token. The default sampler chain (set via `--samplers`) is: ``` penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature ``` The same order can be expressed in short form with `--sampling-seq` as `edskypmxt`. You can reorder or remove stages to change behavior. Because the order matters, moving `temperature` earlier or later, for example, changes the result. Brief description of each stage: - **temperature** scales the logits; higher values make the distribution flatter and output more random. - **top-k** keeps only the K most likely tokens. - **top-p (nucleus)** keeps the smallest set of tokens whose probabilities sum to at least p. - **min-p** keeps tokens whose probability is at least `min-p * (top token probability)`. - **typical-p** performs locally-typical sampling. - **top-n-sigma** keeps tokens within n standard deviations of the distribution. - **repeat / frequency / presence penalties** discourage repetition of tokens already seen. - **DRY** ("Don't Repeat Yourself") applies an n-gram repetition penalty. - **XTC** (exclude-top-choices) drops the most probable tokens with some probability to increase diversity. - **mirostat** dynamically targets a perplexity setpoint. ## Key Parameters | Flag | Default | Notes | |------|---------|-------| | `--temp` | `0.80` | Logit scaling; higher = more random | | `--top-k` | `40` | `0` = off | | `--top-p` | `0.95` | `1.0` = off | | `--min-p` | `0.05` | `0.0` = off | | `--typical` / `--typical-p` | `1.00` | Off | | `--top-n-sigma` / `--top-nsigma` | `-1.00` | Off | | `--xtc-probability` | `0.00` | | | `--xtc-threshold` | `0.10` | | | `--repeat-last-n` | `64` | Window for repetition penalty | | `--repeat-penalty` | `1.00` | Off in CLI (see pitfall below) | | `--presence-penalty` | `0.00` | | | `--frequency-penalty` | `0.00` | | | `--dry-multiplier` | `0.00` | DRY off when 0 | | `--dry-base` | `1.75` | | | `--dry-allowed-length` | `2` | | | `--dry-penalty-last-n` | `-1` | Default breakers: `\n : " *` | | `--mirostat` | `0` | `1` = v1, `2` = v2 | | `--mirostat-lr` | `0.10` | Learning rate | | `--mirostat-ent` | `5.00` | Target entropy | | `--dynatemp-range` | `0.00` | Dynamic temperature range | | `--dynatemp-exp` | `1.00` | Dynamic temperature exponent | | `-s` / `--seed` | `-1` | `-1` = random | There is also a recent **adaptive-p** sampler (PR #17927): `--adaptive-target` (`-1.00` = off) and `--adaptive-decay` (`0.90`). Additional controls include `-l` / `--logit-bias`, `--ignore-eos`, and the experimental `-bs` / `--backend-sampling`. ## When To Use - For more focused, deterministic output, lower `--temp` and/or tighten `--top-p` / `--top-k` / `--min-p`. - For more creative or varied output, raise `--temp` or enable XTC. - To suppress repetition or looping, enable DRY (`--dry-multiplier`) or the repetition penalties. - To target a stable perplexity automatically, enable mirostat. ## Risks & Pitfalls - **CLI vs server default mismatch:** the server `/completion` request defaults `repeat_penalty` to `1.1`, while the CLI `--repeat-penalty` default is `1.00` (off). The same prompt can therefore behave differently between [[binary-llama-cli]] and the server unless you set the value explicitly. - Sampler *order* matters; reordering the chain changes output. - `--backend-sampling` is experimental. ## Related Concepts - [[server-api]] — exposes these same samplers as request fields (and is where the `repeat_penalty` default differs). - [[gbnf-grammars]] — constrains *which* tokens are allowed, complementary to sampling. - [[kv-cache-and-context]] — governs the context the sampler operates within. - [[binary-llama-cli]] — the primary tool for setting these flags. ## Sources - cli-readme - server-readme --- title: "Server API (llama-server REST endpoints)" type: concept tags: [server-api, api, llama-server, deployment, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: [raw/server-readme.md] confidence: high llama_build: "master (~2026-05)" --- # Server API (llama-server REST endpoints) ## Definition The **Server API** is the set of HTTP/REST endpoints exposed by [[binary-llama-server|llama-server]], the pure C/C++ web server that wraps llama.cpp inference. It serves three families of endpoints from a single process: - **Native** endpoints, specific to llama.cpp. - **OpenAI-compatible** endpoints (drop-in replacements for the OpenAI REST API). - **Anthropic-compatible** endpoints. This lets existing OpenAI or Anthropic client libraries talk to a local model with little or no code change, while the native endpoints expose lower-level features (raw completion, tokenization, embeddings, slot management). By default the server listens on `127.0.0.1:8080`. Responses use [nlohmann::json](https://github.com/nlohmann/json), and the server itself is built on [cpp-httplib](https://github.com/yhirose/cpp-httplib). ## How It Works The server is started against a model file and then accepts JSON requests over HTTP. A minimal start and call: ```sh ./llama-server -m models/7B/ggml-model.gguf -c 2048 ``` ```sh curl --request POST --url http://localhost:8080/completion \ --header "Content-Type: application/json" \ --data '{"prompt":"Building a website can be done in 10 simple steps:","n_predict":128}' ``` Each request is assigned to a **slot** (an independent inference context). With `-np N` the server holds `N` parallel slots, and **continuous batching** (on by default) interleaves their tokens so multiple requests progress concurrently. Health is reported at `GET /health`, which returns `200 {"status":"ok"}` once the model is ready and `503` while it is still loading. Errors follow the OpenAI shape: `{"error":{"code","message","type"}}`. ### Endpoints | Endpoint | Method(s) | Family | Purpose | |---|---|---|---| | `/health` (also `/v1/health`) | GET | Native | Readiness check (`200`/`503`) | | `/completion` | POST | Native | Raw text completion | | `/tokenize` | POST | Native | Text -> tokens | | `/detokenize` | POST | Native | Tokens -> text | | `/apply-template` | POST | Native | Render the chat template | | `/embedding(s)` | POST | Native | Embeddings (supports `--pooling none`) | | `/reranking` (also `/rerank`, `/v1/rerank`) | POST | Native | Rerank documents | | `/infill` | POST | Native | Fill-in-the-middle | | `/props` | GET, POST | Native | Server/model properties | | `/slots` | GET | Native | List slots | | `/slots/{id}?action=save\|restore\|erase` | POST | Native | Manage a slot's KV state | | `/metrics` | GET | Native | Prometheus metrics (if `--metrics`) | | `/lora-adapters` | GET, POST | Native | List/set LoRA adapters | | `/v1/models` | GET | OpenAI | List models | | `/v1/completions` | POST | OpenAI | Text completion | | `/v1/chat/completions` | POST | OpenAI | Chat completion | | `/v1/responses` | POST | OpenAI | Responses API | | `/v1/embeddings` | POST | OpenAI | Embeddings (requires pooling != none) | | `/v1/messages` | POST | Anthropic | Messages API | | `/v1/messages/count_tokens` | POST | Anthropic | Token counting | The OpenAI-compatible `/v1/chat/completions` endpoint adds several extras beyond the standard schema: `response_format` (`"json_object"` or `"json_schema"`), `chat_template_kwargs`, `reasoning_format`, `parse_tool_calls`, and `parallel_tool_calls`. Its responses include `timings`, `usage`, and `reasoning_content`. ## Key Parameters These flags configure the server itself (environment variables mirror them as `LLAMA_ARG_*`; the CLI overrides the environment). | Flag | Default | Meaning | |---|---|---| | `-m` / `--model` | — | GGUF model file to load | | `-hf /[:quant]` | — | Pull a model straight from Hugging Face (default quant `Q4_K_M`) | | `-c` / `--ctx-size N` | from model when `0` | Context window size | | `-n` / `--predict N` | — | Max tokens to predict (`-1` = infinite) | | `-ngl` / `--n-gpu-layers N` | `auto` (also `all`) | Layers offloaded to GPU | | `-np` / `--parallel N` | `-1` = auto | Number of parallel slots | | `-cb` / `--cont-batching` | ON (`-nocb` disables) | Continuous batching | | `-fa` / `--flash-attn [on\|off\|auto]` | `auto` | Flash attention | | `-b` / `--batch-size` | `2048` | Logical batch size | | `-ub` / `--ubatch-size` | `512` | Physical (micro) batch size | | `--host` / `--port` | `127.0.0.1` / `8080` | Bind address | | `--api-key` / `--api-key-file` | — | API key auth (env `LLAMA_API_KEY`) | | `--ssl-key-file` / `--ssl-cert-file` | — | Enable HTTPS | | `--metrics` | OFF | Expose Prometheus metrics | | `--jinja` | ON | Jinja chat templating (enables tool calling) | | `--chat-template ` | — | Built-in template (chatml, llama2/3/4, deepseek, gemma, mistral-v*, gpt-oss, granite, qwen...) | | `--reasoning-format {none,deepseek,deepseek-legacy}` | — | How reasoning traces are surfaced | | `-rea` / `--reasoning [on\|off\|auto]`, `--reasoning-budget N` | — | Reasoning controls | | `--ui` / `--no-ui` | — | Built-in web UI (old `--webui*` names deprecated) | | `-a` / `--alias` | — | Public model name | Request-level fields for `/completion` include `prompt`, `n_predict`, `temperature`, `top_k`, `top_p`, `min_p`, `typical_p`, `n_keep`, `stop`, `seed`, `logit_bias`, `n_probs`, `grammar`, `json_schema`, `cache_prompt`, `samplers`, and `ignore_eos`. See [[sampling-parameters]] and [[gbnf-grammars]] for the sampling and grammar fields. ### Prompt caching and KV state Prompt caching is on by default (`--cache-prompt`) and is requested per-call with `cache_prompt:true`; partial reuse is controlled by `--cache-reuse N` / `n_cache_reuse`. The KV cache data type is set with `-ctk` / `-ctv` (default `f16`; `q8_0`, `q4_0`, etc.). Slot state can be persisted with `--slot-save-path` and the `save`/`restore`/`erase` actions. See [[kv-cache-and-context]]. ### Router / multi-model mode Launch with **no** `-m` to enter router mode: requests select a model by the `"model"` JSON field (POST) or `?model=` query (GET). Configure with `--models-dir`, `--models-preset` (an `.ini` file), and `--models-max` (default `4`). `--sleep-idle-seconds N` (`-1` = off) unloads an idle model and its KV cache. ## When To Use - You want to serve a model over HTTP to multiple clients or apps. - You need OpenAI- or Anthropic-compatible endpoints so existing SDKs work unchanged. - You need embeddings ([[embeddings]]), reranking, function calling ([[function-calling]]), or multimodal input ([[multimodal-mtmd]]) behind a stable API. - You want to run several models from one process (router mode). ## Risks & Pitfalls - The server binds to `127.0.0.1` by default; setting `--host 0.0.0.0` exposes it to the network — protect it with `--api-key` and/or TLS. - `/v1/embeddings` requires a pooling mode other than `none`; the native `/embeddings` endpoint is the one that supports `--pooling none`. - Metrics are OFF unless `--metrics` is passed. - Aggressive KV cache quantization (e.g. `-ctk q4_0`) saves memory but can degrade quality, especially [[function-calling|tool calling]]. ## Related Concepts - [[binary-llama-server]] — the binary that implements this API - [[sampling-parameters]] — request sampling fields - [[gbnf-grammars]] — `grammar` / `json_schema` constrained output - [[function-calling]] — tool calls over the chat endpoint - [[embeddings]] — embedding and reranking endpoints - [[multimodal-mtmd]] — multimodal input - [[kv-cache-and-context]] — context size and KV cache types ## Sources - server-readme --- title: "Speculative Decoding" type: concept tags: [speculative-decoding, performance, well-established, advanced] created: 2026-05-30 updated: 2026-05-30 sources: [raw/cli-readme.md] confidence: high llama_build: "master (~2026-05)" --- # Speculative Decoding ## Definition Speculative decoding is a technique that speeds up text generation. A small, fast **draft** model proposes several tokens ahead, and the large **target** model then verifies all of them in a single batch. Tokens the target model accepts are effectively "free" because they were validated together rather than generated one at a time. The throughput win is largest when the acceptance rate is high. ## How It Works At each step the draft proposes a short run of candidate tokens. The target model runs once over that batch and confirms how many of the proposed tokens it would have produced itself; accepted tokens are kept and the rest are discarded. Some draft strategies do not even need a separate draft model — for example **n-gram** drafting predicts continuations from recent context, and **MTP** (multi-token-prediction) and **Eagle3** are model-internal draft strategies. ## Key Parameters The speculative-decoding flags were heavily reworked on master: - `--spec-draft-model` / `-md` / `--model-draft` — the draft model GGUF. - `--spec-draft-n-max` — tokens drafted per step (default `3`). - `--spec-draft-n-min` — minimum tokens drafted (default `0`). - `--spec-type {none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache}` — selects the draft strategy. - Many `--spec-ngram-*` flags tune the n-gram strategies. - The draft model gets its own mirror flags for HF source, threads, CPU, and ngl (GPU layers). - `--spec-default` — a convenience preset. **Removed legacy flags** (note if migrating from older builds): `--draft`, `--draft-n`, `--draft-max`, `--draft-min`, and `--draft-n-min` have been removed. Use `--spec-draft-n-max` and `--spec-ngram-mod-n-max` instead. ## When To Use - When you want faster generation from a large target model and have either a small compatible draft model or can use a draft-free strategy (n-gram, MTP, Eagle3). - When the draft model's predictions are usually accepted by the target — high acceptance is what produces the speedup. ## Risks & Pitfalls - This area is **fast-moving**; flag names and strategy options change between builds. The concept is well established but flag-level detail should be re-checked against your build. - If acceptance is low (a poorly matched draft model), speculative decoding can add overhead instead of saving time. - Old `--draft*` flags no longer exist and must be replaced with their `--spec-*` equivalents. ## Related Concepts - [[binary-llama-cli]] — the tool that exposes these `--spec-*` flags. - [[kv-cache-and-context]] — another generation-acceleration mechanism. - [[quantization]] — draft models are often small/quantized for speed. ## Sources - cli-readme --- title: "CPU Backend" type: entity tags: [cpu, blas, build, foundational, well-established] created: 2026-05-30 updated: 2026-05-30 sources: [docs-build, llamacpp-readme] confidence: high llama_build: "master (~2026-05)" --- # CPU Backend ## Overview The CPU backend is the **default** backend in llama.cpp. It needs no flag and is always built. It targets x86 (AVX/AVX2/AVX512/AMX), ARM (NEON), and RISC-V (RVV) instruction sets. ## Characteristics Because it is always present, the CPU backend acts as the universal fallback and as the host side of CPU+GPU hybrid inference. Several optional acceleration add-ons can speed it up: - **BLAS** (`-DGGML_BLAS=ON`): a numerical linear-algebra library. It helps prompt processing only for batch sizes greater than 32. - **Apple Accelerate**: default on Mac. - **Arm KleidiAI** (`-DGGML_CPU_KLEIDIAI=ON`): controlled at runtime via the `GGML_KLEIDIAI_SME` environment variable. - **AMD ZenDNN** for EPYC (`-DGGML_ZENDNN=ON`): auto-downloads ZenDNN on the first build. ## How to Use The CPU backend requires no special build flag. For CPU+GPU hybrid inference you offload some layers to a GPU with `-ngl` while the rest run on CPU. The `-cmoe` / `--cpu-moe` option keeps MoE (Mixture-of-Experts) expert tensors on the CPU. See [[concepts/build-and-backends]] for the full build flow and the backend matrix. ## Related Entities - [[concepts/build-and-backends]] - [[entities/project-llama-cpp]] --- title: "CUDA Backend" type: entity tags: [cuda, build, well-established, deployment] created: 2026-05-30 updated: 2026-05-30 sources: [docs-build] confidence: high llama_build: "master (~2026-05)" --- # CUDA Backend ## Overview The CUDA backend runs inference on **NVIDIA GPUs** using custom CUDA kernels. It is the most-used backend for NVIDIA hardware. It requires the CUDA toolkit and is enabled with the flag `-DGGML_CUDA=ON`. ## Characteristics Verbatim build command: ``` cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release ``` Architecture options: - **Portable all-arch build:** `-DGGML_NATIVE=OFF`. - **Pin specific architectures:** `-DCMAKE_CUDA_ARCHITECTURES="86;89"`. Runtime environment variables: - `CUDA_VISIBLE_DEVICES` - `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` (use host RAM for overflow) - `GGML_CUDA_P2P` - `CUDA_SCALE_LAUNCH_QUEUES=4x` Compile-time performance options: `GGML_CUDA_FORCE_MMQ`, `GGML_CUDA_FORCE_CUBLAS`, `GGML_CUDA_PEER_MAX_BATCH_SIZE` (128), and `GGML_CUDA_FA_ALL_QUANTS`. A prebuilt Docker image is available as `:server-cuda`. ## How to Use Install the CUDA toolkit, then configure and build with `-DGGML_CUDA=ON` as shown above. Use `CUDA_VISIBLE_DEVICES` to select GPUs at runtime. See [[concepts/build-and-backends]] for the general build flow and backend matrix. ## Related Entities - [[concepts/build-and-backends]] - [[entities/project-llama-cpp]] --- title: "Metal Backend" type: entity tags: [metal, build, well-established, foundational] created: 2026-05-30 updated: 2026-05-30 sources: [docs-build] confidence: high llama_build: "master (~2026-05)" --- # Metal Backend ## Overview The Metal backend runs inference on the **Apple Silicon (M-series) GPU** using Apple's Metal API. It is a first-class target: the project is heavily optimized for Apple Silicon, combining NEON, Accelerate, and Metal. ## Characteristics Metal is **enabled by default on macOS**, so no flag is needed to turn it on. To disable it at build time, use `-DGGML_METAL=OFF`. At runtime, `-ngl 0` disables GPU inference (falling back to CPU). ## How to Use On macOS the default build already includes Metal; just follow the standard build flow in [[concepts/build-and-backends]]. To build without GPU support, add `-DGGML_METAL=OFF`. To run on CPU only at runtime, pass `-ngl 0`. ## Related Entities - [[concepts/build-and-backends]] - [[entities/project-llama-cpp]] --- title: "ROCm / HIP Backend" type: entity tags: [rocm, build, deployment, well-established] created: 2026-05-30 updated: 2026-05-30 sources: [docs-build] confidence: high llama_build: "master (~2026-05)" --- # ROCm / HIP Backend ## Overview The ROCm backend runs inference on **AMD GPUs** via HIP/ROCm. **ROCm** is AMD's SDK, while **HIP** is the build flag used to enable this backend: `-DGGML_HIP=ON`. (The older `-DGGML_HIPBLAS` flag is superseded by `-DGGML_HIP=ON`.) ## Characteristics - Enable flag: `-DGGML_HIP=ON`, with optional GPU target `-DGPU_TARGETS=gfx1030`. - rocWMMA flash-attention: `-DGGML_HIP_ROCWMMA_FATTN=ON`. - Runtime environment variables: `HIP_VISIBLE_DEVICES`, and `HSA_OVERRIDE_GFX_VERSION` (not available on Windows). Verbatim build command: ``` HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 16 ``` ## How to Use Install ROCm, then configure and build using the verbatim command above (adjusting `-DGPU_TARGETS` to your GPU architecture). Use `HIP_VISIBLE_DEVICES` to select GPUs at runtime. See [[concepts/build-and-backends]] for the general build flow and backend matrix. ## Related Entities - [[concepts/build-and-backends]] - [[entities/project-llama-cpp]] --- title: "Vulkan Backend" type: entity tags: [vulkan, build, deployment, well-established] created: 2026-05-30 updated: 2026-05-30 sources: [docs-build] confidence: high llama_build: "master (~2026-05)" --- # Vulkan Backend ## Overview The Vulkan backend is a **cross-vendor GPU backend**. It works on Windows, Linux, macOS, and Docker, and runs on AMD, NVIDIA, and Intel GPUs, including integrated GPUs. It is a good vendor-neutral option when CUDA or ROCm are unavailable. It is enabled with `-DGGML_VULKAN=ON`. ## Characteristics - Requires the **Vulkan SDK** plus **SPIRV-Headers**. - On macOS, use MoltenVK / KosmicKrisp via the `VK_ICD_FILENAMES` environment variable, and combine with `-DGGML_METAL=OFF`. ## How to Use Install the Vulkan SDK and SPIRV-Headers, then configure with `-DGGML_VULKAN=ON` and follow the standard build flow in [[concepts/build-and-backends]]. On macOS, set `VK_ICD_FILENAMES` to a MoltenVK/KosmicKrisp ICD and add `-DGGML_METAL=OFF`. ## Related Entities - [[concepts/build-and-backends]] - [[entities/project-llama-cpp]] --- title: "llama-imatrix (binary)" type: entity tags: [imatrix, quantization, accuracy, developer] created: 2026-05-30 updated: 2026-05-30 sources: [raw/imatrix-readme.md] confidence: high llama_build: "master (~2026-05)" --- # llama-imatrix (binary) ## Overview `llama-imatrix` computes and manages importance matrices. It runs a full-precision model over a calibration corpus and records per-weight importance statistics, producing the file that [[entities/binary-llama-quantize]] later applies to improve low-bit [[concepts/quantization]] quality. Its source lives in `tools/imatrix`. See [[concepts/imatrix]] for the underlying concept. ## Characteristics - Requires a model and a calibration text file. - Default output format is GGUF; a legacy `dat` format is also available. - Can merge multiple matrices and report detailed statistics. - Offloading to GPU (`-ngl 99`) speeds up the calibration pass. **Mandatory inputs:** | Flag | Purpose | | --- | --- | | `-m` / `--model` | Model to calibrate | | `-f` / `--file` | Calibration text | **Other flags:** `-ngl 99` (GPU offload for speed), `-o` / `--output`, `--output-format {gguf,dat}`, `--output-frequency`, `--save-frequency`, `--no-ppl`, `--process-output`, `--parse-special`, `--chunk` / `--chunks`, `--in-file` (merge), `--show-statistics`, `-lv` / `--verbosity`. ## How to Use Compute a matrix from calibration data, offloading to GPU: ``` ./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt -ngl 99 ``` Merge several previously computed matrices into one: ``` ./llama-imatrix --in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf -o imatrix-combined.gguf ``` The resulting file is then passed to [[entities/binary-llama-quantize]] via its `--imatrix` flag. ## Related Entities - [[entities/binary-llama-quantize]] — consumes the imatrix this tool produces. - [[concepts/imatrix]] — the concept and statistics behind the tool. - [[concepts/quantization]] — the process the imatrix improves. --- title: "llama-bench" type: entity tags: [llama-bench, benchmarking, performance, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: [raw/llama-bench-readme.md] confidence: high llama_build: "master (~2026-05)" --- # llama-bench ## Overview `llama-bench` (in `tools/llama-bench`) is llama.cpp's standard throughput benchmark. It reports average tokens per second with a standard deviation (`avg ± stddev`). It measures pure model throughput and **excludes** tokenization and sampling time. ## Characteristics **Test types** - `pp` — prompt processing (`-p` / `--n-prompt`, default `512`). - `tg` — text generation (`-n` / `--n-gen`, default `128`). - `pg` — combined prompt + generation (`-pg pp,tg`). - `-r` — repetitions (default `5`, results averaged). - `-d` / `--n-depth` — prefills the KV cache; shown as `@ d` (e.g. `pp512 @ d512`). **Parameters and defaults** | Flag | Default | |------|---------| | `-b` | `2048` | | `-ub` | `512` | | `-ctk` / `-ctv` | `f16` | | `-ngl` | `-1` | | `-ncmoe` | `0` | | `-sm` | `layer` | | `-mg` | `0` | | `-nkvo` | `0` | | `-fa` | `auto` | | `-dev` | `auto` | | `-ts` | `0` | **Sweeps:** values can be given as comma lists or repeated flags; numeric ranges use `first-last[+step|*mult]`. **Run control:** `--prio`, `--delay 0`, `--numa`, `--no-warmup`, `--progress`, `-rpc`. **Hugging Face:** `-hf` / `-hff` / `-hft`. Default model: `models/7B/ggml-model-q4_0.gguf`. **Output:** `-o {csv,json,jsonl,md,sql}` (default `md`). Sample markdown row: ``` | llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | -1 | tg 128 | 132.19 ± 0.55 | ``` ## How to Use ``` ./llama-bench -m models/7B/ggml-model-q4_0.gguf -p 0 -n 128,256,512 ./llama-bench -ngl 10,20,30,31,32,33,34,35 ./llama-bench -d 0,512 ``` The first sweeps text-generation length, the second sweeps GPU layer offload, and the third sweeps KV-cache prefill depth. ## Related Entities - [[build-and-backends]] — `-dev`, `-ngl`, and `-sm` select which backend/devices to benchmark. - [[quantization]] — benchmarks commonly compare quantized variants (e.g. the `Q4_0` sample row above). --- title: "llama-cli" type: entity tags: [llama-cli, developer, intermediate, well-established] created: 2026-05-30 updated: 2026-05-30 sources: [raw/cli-readme.md] confidence: high llama_build: "master (~2026-05)" --- # llama-cli ## Overview `llama-cli` (in `tools/cli`) is llama.cpp's primary inference command-line tool. It handles both one-shot prompting and interactive chat against a GGUF model. Its `--help` text is auto-generated by `llama-gen-docs` and is grouped into Common, Sampling, and CLI-specific sections. ## Characteristics - **Two modes:** one-shot completion and interactive conversation/chat. - **Conversation flags:** `-cnv` / `--conversation` (auto-enabled when the model has a chat template, and turns on interactive mode), `-no-cnv` to disable, `-st` / `--single-turn`, `-r` / `--reverse-prompt`, `-sys` / `--system-prompt` (plus `-sysf` for a file), `-mli` / `--multiline-input`. - **Generation / offload flags (shared with the server):** `-n` / `--predict` (`-1` = infinite), `-c` / `--ctx-size` (`0` = from model), `-b 2048`, `-ub 512`, `-ngl` / `--n-gpu-layers` (auto / number / all), `-sm` / `--split-mode {none,layer,row,tensor}` (default `layer`), `-mg` / `--main-gpu` (`0`), `-fa` / `--flash-attn` (auto), `-dev` / `--device`, `--list-devices`, `-ts` / `--tensor-split`. - **Grammar / structured output:** `--grammar`, `--grammar-file`, `-j` / `--json-schema`, `-jf`. - **Multimodal:** `-mm` / `--mmproj`, `--image`, `--audio`. - **LoRA / control vectors:** `--lora`, `--lora-scaled FNAME:SCALE`. - **Templating:** `--jinja` (on), `--chat-template` (large built-in list), `--reasoning-format`, `-rea` / `--reasoning`, `--reasoning-budget`. - **Quick presets:** `--gpt-oss-20b-default`, `--gpt-oss-120b-default`, `--vision-gemma-4b/12b-default`, `--spec-default`. ## How to Use One-shot prompt: ``` llama-cli -m model.gguf -p "prompt" llama-cli -m model.gguf -p "Once upon a time" ``` Conversation / chat: ``` llama-cli -m model.gguf -cnv ``` Run a model straight from Hugging Face: ``` llama-cli -hf ggml-org/gemma-3-1b-it-GGUF ``` ## Related Entities - [[sampling-parameters]] — the Sampling group of flags this tool exposes. - [[kv-cache-and-context]] — context size and KV cache flags above. - [[speculative-decoding]] — `--spec-*` / `--spec-default` drafting support. - [[gbnf-grammars]] — the `--grammar` / `--json-schema` options. - [[multimodal-mtmd]] — the `-mm` / `--image` / `--audio` options. --- title: "llama-quantize (binary)" type: entity tags: [quantization, llama-quantize, developer, gguf] created: 2026-05-30 updated: 2026-05-30 sources: [raw/quantize-readme.md] confidence: high llama_build: "master (~2026-05)" --- # llama-quantize (binary) ## Overview `llama-quantize` is the tool that converts a high-precision GGUF (f16/f32) into a quantized GGUF. It is the program that actually performs [[concepts/quantization]], reading a source [[concepts/gguf-format]] file and writing a smaller one in a chosen `ggml_type`. Its source lives in `tools/quantize`. ## Characteristics - Input is a high-precision GGUF; output is a quantized GGUF. - The target type is case-insensitive. - It can optionally apply an importance matrix from [[entities/binary-imatrix]] to preserve accuracy at low bit widths. - A special type `COPY` is supported (copies tensors without requantizing). **Signature:** ``` ./llama-quantize [options] input-f32.gguf [output.gguf] type [nthreads] ``` **Key flags:** | Flag | Purpose | | --- | --- | | `--imatrix FILE` | Apply an importance matrix | | `--allow-requantize` | Permit requantizing already-quantized weights | | `--leave-output-tensor` | Leave the output tensor unquantized | | `--pure` | Disable k-quant mixtures (uniform type) | | `--output-tensor-type` | Force the output tensor type | | `--token-embedding-type` | Force the token embedding type | | `--keep-split` | Preserve sharded layout | | `--include-weights` / `--exclude-weights` | Select tensors to (de)quantize | | `--tensor-type` | Per-tensor regex override (repeatable) | | `--prune-layers` | Drop layers | | `--override-kv` | Override a metadata key-value | ## How to Use Basic quantization to `Q4_K_M`: ``` ./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M ``` Per-tensor regex override (here forcing certain `attn_k` tensors to `q5_k`): ``` --tensor-type "\.(\d*[13579])\.attn_k=q5_k" ``` Override a metadata key-value: ``` --override-kv qwen3moe.expert_used_count=int:16 ``` To improve low-bit results, first produce an importance matrix with [[entities/binary-imatrix]] and pass it via `--imatrix FILE`. ## Related Entities - [[entities/binary-imatrix]] — produces the imatrix that `llama-quantize` consumes via `--imatrix`. - [[concepts/quantization]] — the concept this tool implements. - [[concepts/gguf-format]] — the input and output container format. - [[concepts/imatrix]] — the importance data applied during quantization. --- title: "llama-server (binary)" type: entity tags: [llama-server, server-api, deployment, binary, well-established, developer, foundational] created: 2026-05-30 updated: 2026-05-30 sources: [raw/server-readme.md] confidence: high llama_build: "master (~2026-05)" --- # llama-server (binary) ## Overview `llama-server` is a pure C/C++ HTTP server that exposes llama.cpp inference over REST. It is built on [cpp-httplib](https://github.com/yhirose/cpp-httplib) for HTTP and [nlohmann::json](https://github.com/nlohmann/json) for JSON, and serves **OpenAI-compatible** and **Anthropic-compatible** endpoints alongside llama.cpp's own native endpoints. It is the standard way to deploy a llama.cpp model as a network service. The full endpoint and request surface it provides is documented under [[server-api]]. ## Characteristics - **Single self-contained binary** — no Python runtime; HTTP and JSON are built in. - **Default bind address** `127.0.0.1:8080`. - **Three endpoint families** — native, OpenAI-compatible, and Anthropic-compatible — from one process. See [[server-api]] for the list. - **Concurrency** via parallel **slots** (`-np N`) plus **continuous batching** (on by default), which interleaves multiple requests' tokens. - **Prompt caching** on by default and per-request KV-state save/restore. - **Router / multi-model mode** when launched with no `-m`: route by the `"model"` JSON field or `?model=` query. - **Built-in web UI** (`--ui` / `--no-ui`) and optional **Prometheus metrics** (`--metrics`, off by default). - **GPU acceleration** via offloaded layers (`-ngl`); see [[backend-cuda]] and [[build-and-backends]]. - Configurable by flags or by mirrored `LLAMA_ARG_*` environment variables (the CLI overrides the environment). ## How to Use Build from source: ```sh cmake -B build && cmake --build build --config Release -t llama-server ``` (Add `-DLLAMA_OPENSSL=ON` to enable SSL/HTTPS support.) Run a local model: ```sh ./llama-server -m models/7B/ggml-model.gguf -c 2048 ``` Run a model straight from Hugging Face (default quant `Q4_K_M`): ```sh llama-server -hf ggml-org/gemma-3-1b-it-GGUF ``` Run in Docker with CUDA: ```sh docker run -p 8080:8080 -v /path/to/models:/models --gpus all \ ghcr.io/ggml-org/llama.cpp:server-cuda \ -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99 ``` (Docker images: `ghcr.io/ggml-org/llama.cpp:server` and `:server-cuda`.) Send a request: ```sh curl --request POST --url http://localhost:8080/completion \ --header "Content-Type: application/json" \ --data '{"prompt":"Building a website can be done in 10 simple steps:","n_predict":128}' ``` Common flags: `-m`/`--model`, `-c`/`--ctx-size`, `-ngl`/`--n-gpu-layers` (default `auto`), `-np`/`--parallel`, `-fa`/`--flash-attn`, `--host`/`--port`, `--api-key`, `--jinja`, `--embedding`. The full flag table lives in [[server-api]]. ## Related Entities - [[project-llama-cpp]] — the parent project - [[binary-llama-cli]] — the command-line inference tool - [[backend-cuda]] — GPU backend for layer offload - [[server-api]] — the REST API this binary implements - [[build-and-backends]] — building and selecting compute backends --- title: "llama-mtmd-cli" type: entity tags: [multimodal, vision, audio, llama-cli, experimental, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: [raw/mtmd-readme.md] confidence: medium llama_build: "master (~2026-05)" --- # llama-mtmd-cli ## Overview `llama-mtmd-cli` (also called `mtmd-cli`) is the unified multimodal command-line tool in llama.cpp, powered by the `libmtmd` library. It replaces the legacy model-specific vision CLIs (`llava-cli`, `qwen2vl-cli`, `minicpmv-cli`, `gemma3-cli`) with a single tool. It loads a text model plus an **mmproj** projector (the multimodal projector that turns images or audio into model-readable embeddings) and accepts image input (stable) and audio input (experimental). For the underlying concept, see [[multimodal-mtmd]]. ## Characteristics - Built on `libmtmd`; one tool for all supported multimodal model families. - Requires two GGUF files at runtime: the language model and a matching mmproj projector. - Image input is stable; audio input is experimental. - Part of llama.cpp's rapidly changing multimodal stack — binary names and flags are point-in-time for master (~2026-05). Key flags: - `-hf ` — pull a pre-quantized model, including its mmproj. - `-m model.gguf --mmproj file.gguf` — supply the model and projector explicitly. - `--no-mmproj` — disable multimodal on an `-hf` model. - `--mmproj local.gguf` — use a custom projector. - `--no-mmproj-offload` — keep the projector on the CPU instead of the GPU. - `-c 8192` — a large context, needed by some models. ## How to Use Verbatim examples from the docs: ``` llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF ``` ``` llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.gguf ``` ``` llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload ``` (The `llama-server` examples show the same model/projector pairing served over HTTP rather than on the CLI.) To produce your own projector, convert with `convert_hf_to_gguf.py --mmproj`. Legacy conversion scripts live under `tools/mtmd/legacy-models`. ## Related Entities - [[multimodal-mtmd]] - [[binary-llama-cli]] - [[project-llama-cpp]] --- title: "ggml (Tensor Library)" type: entity tags: [ggml, foundational, well-established, developer, cpu, cuda, metal, vulkan] created: 2026-05-30 updated: 2026-05-30 sources: [llamacpp-readme, gguf-spec] confidence: high llama_build: "master (~2026-05)" --- # ggml (Tensor Library) ## Overview ggml is a **tensor library written in C** by Georgi Gerganov — a minimal, dependency-free machine-learning library built for efficient model *inference*. It is the low-level **compute engine** that sits underneath [[entities/project-llama-cpp|llama.cpp]]: where llama.cpp handles model architectures, tokenizers, sampling, and the user-facing binaries, ggml does the actual math — defining tensors, building the compute graph, executing operations on each hardware backend, and providing the integer **quantization** ([[concepts/quantization]]) that shrinks models. The name combines the author's initials (GG) with "ML" (machine learning). llama.cpp is described in its own README as "the main playground for developing features of the ggml library" — the two are co-developed. The same library also powers `whisper.cpp` (speech-to-text) and other inference projects, so ggml is a general-purpose foundation, not Llama-specific. ## Characteristics - **Plain C, minimal dependencies** — designed to compile and run almost anywhere. - **The backend system lives here.** ggml implements the compute backends that llama.cpp exposes: CPU (with SIMD — AVX/NEON/etc.), CUDA, Metal, Vulkan, ROCm/HIP, SYCL, and more. See [[concepts/build-and-backends]]. - **Quantization** — ggml defines the quantized tensor types (the `ggml_type` enum: Q4_K_M, Q8_0, the IQ-series, etc.) used by [[concepts/quantization]]. - **GGUF is ggml's file format.** [[concepts/gguf-format|GGUF]] (the "GG" stands for ggml) is the single-file container that stores a model's weights, metadata, and tokenizer for ggml/llama.cpp to load. The spec actually lives in the `ggml-org/ggml` repo, not the llama.cpp repo. - **Compute graph + memory efficiency** — builds an explicit graph of operations and is optimized for low memory use (mmap loading, no heavy framework runtime). ## How to Use Most users never touch ggml directly — they use it **through** [[entities/project-llama-cpp|llama.cpp]] (or a wrapper like Ollama/LM Studio). You encounter ggml in three practical ways: - **Choosing a backend** at build time selects which ggml backend is compiled in (`-DGGML_CUDA=ON`, `-DGGML_METAL`, `-DGGML_VULKAN=ON`, …) — these are *ggml* flags. See [[concepts/build-and-backends]]. - **As a model format** — every GGUF file you download is a ggml-format container ([[concepts/gguf-format]]). - **As a C library** — developers can build directly against ggml's C API for custom inference; the `ggml-org/ggml` repo is the standalone home of the library. > Affiliation note (confidence: medium): in 2026, Georgi Gerganov's **ggml.ai** joined **Hugging Face** (per a llama.cpp Discussions announcement). ggml and llama.cpp remain open-source under ggml-org. ## Related Entities - [[entities/project-llama-cpp]] — the inference project built on ggml - [[concepts/gguf-format]] — ggml's model file format - [[concepts/quantization]] — the quantized tensor types ggml defines - [[concepts/build-and-backends]] — the hardware backends ggml implements --- title: "llama.cpp (Project)" type: entity tags: [foundational, well-established, developer, deployment] created: 2026-05-30 updated: 2026-05-30 sources: [llamacpp-readme, docs-install] confidence: high llama_build: "master (~2026-05)" --- # llama.cpp (Project) ## Overview llama.cpp is **"LLM inference in C/C++"** by ggml-org (Georgi Gerganov), released under the **MIT license**. Its goal is to provide minimal-setup, state-of-the-art LLM inference across a wide range of hardware, both locally and in the cloud. It is also the main playground for developing features of the [[entities/ggml|ggml]] tensor library. Models use the GGUF format ([[concepts/gguf-format]]). > **Name clarifier — three different things that all say "llama":** > - **Llama** (the *models*: Llama 2, Llama 3, …) → made by **Meta**. > - **llama.cpp** (this project, the *engine*) → by **Georgi Gerganov / ggml-org**, *not* Meta. Named for the models it was first written to run. > - **Ollama** (a popular *app*) → a **separate company** (not Meta), and itself a wrapper built on llama.cpp/[[entities/ggml|ggml]] (see [[syntheses/llamacpp-vs-ollama]]). > > So: Meta makes the *models*; llama.cpp is the independent *engine* that runs them; Ollama/LM Studio/KoboldCpp/etc. are *apps* built on that engine. The shared word "llama" is the only connection. ## Characteristics Headline features: - Plain C/C++ with no dependencies. - Apple Silicon as a first-class target (ARM NEON, Accelerate, Metal). - x86 AVX/AVX2/AVX512/AMX support. - RISC-V SIMD support. - 1.5-bit to 8-bit integer quantization ([[concepts/quantization]]). - Custom CUDA kernels, plus HIP for AMD and MUSA for Moore Threads. - Vulkan and SYCL backends. - CPU+GPU hybrid inference for models larger than available VRAM. Supported model families (examples): LLaMA 1/2/3, Mistral / Mixtral (MoE), Qwen, DeepSeek, Gemma, Phi / PhiMoE, Falcon, Command-R, Granite, GLM-4, Mamba, RWKV-6/7, Grok-1, Hunyuan, LFM2, and gpt-oss (native MXFP4). Multimodal families include LLaVA, MiniCPM, Moondream, Qwen2-VL, GLM-EDGE, and LFM2-VL (see [[concepts/multimodal-mtmd]]). Bundled binaries and tools: llama-cli ([[entities/binary-llama-cli]]), llama-server ([[entities/binary-llama-server]]), llama-quantize ([[entities/binary-llama-quantize]]), llama-bench ([[entities/binary-llama-bench]]), llama-perplexity, llama-imatrix ([[entities/binary-imatrix]]), llama-mtmd-cli ([[entities/binary-mtmd]]), and llama-simple. It also ships the Python conversion script `convert_hf_to_gguf.py` and an XCFramework precompiled library for iOS/visionOS/tvOS/macOS Swift. ## How to Use Install methods: - Windows: `winget install llama.cpp` - macOS + Linux: `brew install llama.cpp` - MacPorts: `sudo port install llama.cpp` - Nix: `nix profile install nixpkgs#llama-cpp` - Docker images: `ghcr.io/ggml-org/llama.cpp` Or build from source (see [[concepts/build-and-backends]]). > Ecosystem note (downstream/community, not yet in this KB): bindings exist for Python/Go/Node/Rust/C#/Java/Swift; UIs include Ollama, LM Studio, Jan, KoboldCpp, GPT4All, and llamafile; infrastructure includes Paddler, GPUStack, and llama-swap. ## Related Entities - [[entities/ggml]] — the tensor library llama.cpp is built on - [[entities/binary-llama-cli]] - [[entities/binary-llama-server]] - [[entities/binary-llama-quantize]] - [[entities/binary-llama-bench]] - [[entities/binary-mtmd]] - [[concepts/build-and-backends]] - [[concepts/gguf-format]] - [[concepts/quantization]] --- title: "Activity Log" type: log --- # Activity Log Append-only log of KB changes. --- ## 2026-05-30 — KB scaffolded **Triggered by:** user creating a llama.cpp knowledge base as the research backbone for a YouTube video on llama.cpp. Modeled on the `llm-wiki` template (Karpathy's "LLM Wiki" pattern), customized for the llama.cpp domain. **Added:** - `CLAUDE.md` — schema tailored to llama.cpp (directory layout, page format, llama.cpp tagging taxonomy, build-tag version awareness, ingest/query/lint workflows). - `README.md` - `wiki/index.md` — empty master catalog with planned starter pages listed. - `wiki/log.md` — this file. - `wiki/journal/template.md` — research-session note template. - Empty subdirs: `raw/`, `wiki/summaries/`, `wiki/concepts/`, `wiki/entities/`, `wiki/syntheses/`, `wiki/presentations/`. **Next:** drop llama.cpp source material into `raw/` (official docs mirrors, video transcripts, discussion/PR dumps) and run "ingest". --- ## 2026-05-30 — seeded raw/ with official docs (13 mirrors) **Triggered by:** user request to fetch the official llama.cpp docs and seed `raw/`. Verbatim mirrors pulled from `ggml-org/llama.cpp@master` (and the GGUF spec from `ggml-org/ggml@master`) on 2026-05-30. **Added raw sources (13):** - `llamacpp-readme.md` — main repo README (overview, features, supported models/backends) - `docs-build.md` — full build guide + backend matrix (CPU/CUDA/Metal/Vulkan/ROCm/SYCL/MUSA/CANN) - `docs-install.md` — package-manager install methods - `server-readme.md` — `llama-server` docs + OpenAI-compatible HTTP API (largest source, ~92KB) - `cli-readme.md` — `llama-cli` (tools/cli) docs and flags - `quantize-readme.md` — `llama-quantize` + quant type table - `imatrix-readme.md` — importance-matrix generation for quantization - `llama-bench-readme.md` — `llama-bench` benchmarking tool - `grammars-readme.md` — GBNF grammar syntax + structured output - `docs-function-calling.md` — tool/function calling support - `docs-multimodal.md` — multimodal overview - `mtmd-readme.md` — `mtmd` multimodal CLI/lib - `gguf-spec.md` — GGUF file-format specification (from ggml repo) **Note:** these are immutable mirrors — do not edit. Build tag at fetch time not pinned (master). When ingesting, record an approximate `llama_build` from the current release tag. **Next:** run "ingest" to generate summary + concept/entity pages from these sources. --- ## 2026-05-30 — ingested all 13 official-doc sources **Triggered by:** user request to ingest all 13 seeded raw sources. Done in two phases (parallel summarizers → parallel page authors) with a shared canonical slug list for consistent cross-links. **Added wiki pages (37 total):** - **Summaries (13)** — one per raw source: `llamacpp-readme`, `docs-build`, `docs-install`, `server-readme`, `cli-readme`, `quantize-readme`, `imatrix-readme`, `llama-bench-readme`, `grammars-readme`, `docs-function-calling`, `docs-multimodal`, `mtmd-readme`, `gguf-spec`. - **Concepts (12)** — `gguf-format`, `quantization`, `imatrix`, `sampling-parameters`, `kv-cache-and-context`, `speculative-decoding`, `gbnf-grammars`, `function-calling`, `embeddings`, `multimodal-mtmd`, `server-api`, `build-and-backends`. - **Entities (12)** — `project-llama-cpp`; binaries `binary-llama-cli`, `binary-llama-server`, `binary-llama-quantize`, `binary-imatrix`, `binary-llama-bench`, `binary-mtmd`; backends `backend-cpu`, `backend-cuda`, `backend-metal`, `backend-vulkan`, `backend-rocm`. **Updated:** `index.md` — fully populated master catalog (page count 0 → 37); this log. **Caveats recorded on pages (per CLAUDE.md version-awareness):** - All pages set `llama_build: "master (~2026-05)"` — sources were master, not a pinned `b####`. - Doc-vs-code lags flagged inline: GGUF spec `general.file_type` enum stale; speculative-decoding flags reworked (legacy `--draft*` removed); function-calling doc self-TODO; multimodal roster "under heavy development." - Real doc inconsistency captured: server `/completion` `repeat_penalty` default 1.1 vs CLI `--repeat-penalty` 1.00. **Not yet created (deferred):** backend pages for SYCL/MUSA/CANN/OpenCL/RPC/etc. (present in the build-and-backends matrix); all syntheses (need a community/benchmark source pass); `llama-perplexity` / `llama-simple` / conversion-script entities. **Next:** add community/benchmark sources, then build the planned syntheses (`quant-types-compared`, `backend-selection-guide`, `llamacpp-vs-ollama`). --- ## 2026-05-30 — community sweep: ingested 15 community sources + 5 syntheses **Triggered by:** user request to sweep popular community sources. Ran a 4-agent web discovery sweep (~28 verified-live sources found), user chose breadth "primary + best secondary" (~15) and synthesis focus = quant selection, vs Ollama/vLLM, deployment/OpenAI API, and a custom "customizable features / tuning" page. Fetched in 5 parallel agents (mirror + summary each), then 3 parallel agents authored the syntheses. **Added raw mirrors (15)** under `raw/community/` — markdown-converted mirrors with provenance headers (NOT pristine): - quant: `community-pr1684-kquants`, `community-artefact2-quant-table`, `community-arxiv-quant-eval`, `community-kaitchup-gguf-guide`, `community-bartowski-quant-guide`, `community-mradermacher-imatrix`, `community-unsloth-dynamic-ggufs` - benchmarks: `community-bench-apple-silicon` (GH#4167), `community-bench-nvidia-cuda` (GH#15013), `community-smcleod-kv-quant`, `community-dgxspark-kv-quant` - comparisons/guides: `community-redhat-vllm-vs-llamacpp`, `community-gh15180-vllm-vs-llamacpp`, `community-steelphoenix-guide`, `community-hf-gguf-usage` **Added wiki pages (20):** - **Summaries (15)** — one per community source above (all `confidence: medium`, dated). - **Syntheses (5)** — `quant-types-compared`, `llamacpp-vs-ollama`, `llamacpp-vs-vllm`, `server-deployment`, `customization-and-tuning`. **Updated:** `index.md` (page count 37 → 57; two source tiers; community caveats section); this log. **Honesty/provenance notes recorded on pages:** - Reddit unfetchable by tooling — no r/LocalLLaMA threads fabricated. - KV-quant conflict (smcleod vs DGX Spark) captured and reconciled → q8_0 safe default. - Vendor bias flagged: Unsloth (Dynamic 2.0), Red Hat (pro-vLLM). - No dedicated Ollama source mirrored — `llamacpp-vs-ollama` rests on the wrapper relationship + community consensus (noted on the page). - Partial fetches: SteelPh0enix (long-form prose refused → structured technical extraction); Kaitchup (partial paywall — taxonomy only, no number tables); arXiv tables from HTML extraction (verify vs PDF). - Absolute perplexity scales differ across quant sources (LLaMA-1-7B vs Llama-3.1-8B) — only relative ordering transfers. **Deferred:** dedicated benchmark/`backend-selection-guide` synthesis (data is ingested; user deprioritized); a mirrored Ollama-vs-llama.cpp benchmark source; Reddit threads (user can paste). --- ## 2026-05-30 — drafted standalone video outline **Triggered by:** user wants a standalone "what is llama.cpp" video — what it is, what makes it different from other local servers, customization/flags, ways to serve, and a head-to-head test vs Ollama (memory/speed). (NOT a quantization module.) **Added (1):** - `presentations/standalone-llamacpp-explainer-outline.md` — 14–20 min, 5 segments matching the brief + cold open/outro. Grounded in project-llama-cpp, llamacpp-vs-ollama, llamacpp-vs-vllm, customization-and-tuning, server-deployment, server-api. Includes on-screen demos with verbatim commands, a **fair head-to-head methodology** (controls: same GGUF/quant, ctx, -ngl, flash-attn, KV type; metrics table; what to measure with `llama-bench` vs `ollama --verbose`), honest expectation-setting (Ollama wraps llama.cpp → expect ~few-% speed gap, story is footprint/control), pull-quotes, and a pre-record flag-verification checklist. **Updated:** `index.md` (Presentations 0 → 1; page count 57 → 58); this log. **Honesty notes baked into the outline:** don't oversell a speed gap (same engine); no mirrored Ollama benchmark source exists so the test is "your hardware, this build"; flags are master (~2026-05) → pre-record checklist added. **Next:** user records/measures the live test; could then write back real numbers as a community benchmark source + a `llamacpp-vs-ollama` data update. --- ## 2026-05-30 — added ggml page + name clarifier + Ollama nuance **Triggered by:** user fact-checked the "wrappers built on llama.cpp" claim (confused Meta's Llama models with llama.cpp/Ollama) and asked what ggml is. Verified the wrapping claim via web search (Ollama/LM Studio/Jan/KoboldCpp/GPT4All/llamafile all build on llama.cpp/ggml; none are Meta) — claim holds, no correction needed. **Added (1 entity):** - `entities/ggml.md` — the C tensor library by Georgi Gerganov that llama.cpp is built on (compute graph, backends, quantization; GGUF is its format; also powers whisper.cpp; ggml.ai joined Hugging Face in 2026, confidence medium). Was a long-standing dangling reference across the KB — now a real page. **Updated:** - `entities/project-llama-cpp.md` — added a **"Name clarifier"** callout (Meta = the *models*; llama.cpp = independent *engine*; Ollama = *separate company* wrapper) + ggml backlink in Related Entities. - `syntheses/llamacpp-vs-ollama.md` — added the **2026 nuance**: Ollama now has its own model-loading engine for some architectures but still built on ggml; "only a thin shell over llama.cpp" is now slightly overstated. - `presentations/standalone-llamacpp-explainer-outline.md` — added an **on-screen beat** in Segment 1 clearing up the three-things-named-llama confusion + ggml name-drop. - `index.md` — entities 12 → 13, total 58 → 59. **Sources for verification:** llama.cpp Wikipedia; ggml-org/llama.cpp README; SitePoint & Starmorph local-LLM tool guides. --- ## 2026-05-30 — built slide deck for the explainer video **Triggered by:** user wants a slideshow for the informational beats of the standalone outline (install/terminal demos done live), themed to the llama.cpp README head image. **Palette:** sampled the actual README head image (downloaded, viewed) — charcoal card `#1e2228` / page `#16191d`, white wordmark, orange flame/`C++` accent `#f0883e` (lighter `#f7a85a`, deep `#e2702a`), gray text `#9aa4ae`, mono for code. Recreated the "LLaMA C++" wordmark in CSS (no external image → fully portable single file). **Added (1 asset):** - `presentations/standalone-llamacpp-explainer-slides.html` — self-contained, dependency-free, keyboard-navigable 15-slide deck (title, hook, what-it-is, name clarifier, the stack, what-is-ggml, why-use-directly, vs vLLM, the six knob families, ways-to-serve, test setup, scoreboard, what-you'll-find, decision, outro). Progress bar, slide counter, click/arrow/F-fullscreen nav, print-to-PDF friendly. Content mirrors the outline's informational segments; install/curl/docker commands intentionally omitted (live terminal). **Updated:** `index.md` (presentations entry); this log. (HTML asset, not counted in the .md page total of 59.) --- ## 2026-05-30 — expanded the customization segment in the slide deck **Triggered by:** user wants the customization segment (the video's highlight) given real depth — break the single six-card grid into a slide per family. **Changed `presentations/standalone-llamacpp-explainer-slides.html`:** the one customization slide became **8** — a quick "map" grid, then a dedicated deep-dive per family (Sampling / Structured output / Tool calling / Context & memory / Speed / Hardware), then a "change one knob, measure" closer. Each deep-dive uses a new two-column layout: left = the key flags (orange mono) with what each does + defaults; right = a "Why it matters" card + the real trade-off/pitfall. Flags/defaults grounded in the concept pages. Deck grew 15 → 22 slides. Added CSS (`.sub`, `.split`, `.flaglist`, `.fl`, `.explain`). Verified render of the Sampling and Tool-calling slides (no overflow). **Updated:** `index.md` (slide count + deep-dive note); the outline's Segment 3 (note pointing to the per-family slides); this log. --- ## 2026-05-30 — wrote two separate recording scripts (slides + demos) **Triggered by:** user wants distinct scripts for the slide voiceover and the non-slide live terminal demos (recorded in separate passes). **Added (2):** - `presentations/standalone-llamacpp-explainer-script-slides.md` — VO narration for all 22 deck slides, with on-screen cues, rough timings, and `→ DEMO n` cut markers. - `presentations/standalone-llamacpp-explainer-script-demos.md` — runbook for the 4 live terminal segments: (1) install + first run, (2) customization in action — structured output + KV-cache memory, (3) serve as OpenAI API — server/Web UI/curl/Python/Docker, (4) the fair Ollama head-to-head (same GGUF via Modelfile, llama-bench vs `ollama run --verbose`/`ollama ps`, scoreboard). Each demo block has exact commands, a `SAY:`/`POINT:` talk track, `⚠︎` gotchas/fallbacks, a prep/shopping list, and transitions. Both share one running order: `S1–S3 → DEMO1 → S4–S9 → S10–S16 → DEMO2 → S17 → DEMO3 → S18 → DEMO4 → S19–S22`. Commands grounded in the server/cli/bench docs + HF GGUF usage; flagged master (~2026-05). **Updated:** `index.md` (Presentations 1 → 4 .md pages; both scripts listed); this log. --- ## 2026-05-31 — pre-record accuracy audit of the deck + both scripts **Triggered by:** user asked for a final accuracy sanity check before recording. Ran 3 parallel verifiers: (A) llama.cpp flags/commands vs the cli/server/bench/grammars mirrors; (B) Ollama commands vs live Ollama docs; (C) conceptual claims vs KB pages + web. **Result:** no conceptual errors (license, ggml/whisper.cpp, wrapper relationship, vLLM framing, quant/KV/spec-decoding, Raspberry Pi — all confirmed). One factual bug + several stale-default / over-stated-as-fact tightenings. **Fixes applied (deck HTML + both scripts):** - ❌→✅ **`/bye`** removed from DEMO 1 — it's Ollama's exit, not llama-cli's (confirmed: zero `/bye` in official cli README). Now `Ctrl-C` / `Ctrl-D`, with a note that `/bye` is Ollama-only. - `-fa 1` → **`-fa on`** (flag now takes `on|off|auto`, default `auto`); softened the "KV quant *requires* -fa" gotcha to "generally requires (esp. V-cache), verify on your build." - `--jinja` reframed from "Required. Turns on…" to "**Enables** tool calling + chat templates (**on by default now**)" — default changed to enabled (deck slide 12 + script S12). - Speed parity **~2–8% → "usually within a few percent… we'll test it ourselves"** (slide 7 note + script S7) — it's community consensus, not a benchmarked source, and DEMO 4 tests it. - "you're running llama.cpp underneath" → **"llama.cpp / ggml underneath"** (slide 21 + S21) — Ollama now has its own ggml-based engine for some models. - GPT4All card "llama.cpp backend" → "GGUF via llama.cpp" (weakest of the six wrappers); sampling VO "penalties/DRY stop repetition" → "…when you turn them on" (both off by default). - Added DEMO 4 note: KV is f16 on both sides by default (so the bench matches); set `OLLAMA_KV_CACHE_TYPE=q8_0` only if quantizing KV. **Verified correct, left as-is:** all sampling defaults (temp 0.8 / top-k 40 / top-p 0.95 / min-p 0.05), `-c 0=from model`, `-ctk/-ctv q8_0`, `--rope-scaling/--yarn-*`, `-np/-cb`, `--spec-type` strategy names, `-sm/-ts/-cmoe`, server bind 127.0.0.1:8080, `/v1/chat/completions`, `--api-key`/`LLAMA_API_KEY`, Web UI at `/`, Docker `:server`/`:server-cuda`, llama-bench `-p`(pp)/`-n`(tg) excl. tokenization. Ollama side fully confirmed against docs.ollama.com (Modelfile `FROM ./*.gguf`, `num_ctx`, `ollama create -f`, `OLLAMA_FLASH_ATTENTION=1`, `run --verbose` rate fields, `ps` PROCESSOR/GPU%, `show` quant). Tool-calling specifics (native template list, `parallel_tool_calls` default-off, q4_0-KV-degrades-tools) trace to `raw/docs-function-calling.md` — sourced & correct. **For live verification on the recording machine (docs don't pin these):** exact `-fa` value form, whether `-ngl` even needs setting (default now `auto`), and current `--spec-*` flag spellings — all already in the outline's pre-record checklist. --- ## 2026-06-10 — removed Obsidian scaffolding from the served wiki Deleted `analytics.md`, `dashboard.md`, `flashcards.md` (Obsidian plugin pages — Dataview/Charts View/Spaced Repetition markup, unusable when served as plain Markdown to agents) and the `journal/` scaffold (template only). The 4 video-production files in `presentations/` moved to repo root (not served); index count 59 -> 58. `CLAUDE.md` directory layout updated: production/planning material lives at repo root, never under `wiki/` (everything under `wiki/` is served publicly). --- title: "llama-cli — Usage & Parameters Reference" type: summary tags: [llama-cli, sampling, kv-cache, context, well-established, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/cli-readme.md"] confidence: high llama_build: "master (~2026-05)" --- # llama-cli — Usage & Parameters Reference ## Key Points - Source is the auto-generated `--help` reference for `llama-cli` (tools/cli), grouped into Common params, Sampling params, and CLI-specific params. Auto-generated by `llama-gen-docs`. - Basic invocation: `llama-cli -m model.gguf -p "..."`. Most flags also have `LLAMA_ARG_*` environment-variable equivalents. - Model loading: `-m, --model FNAME`; `-mu, --model-url`; `-hf, -hfr, --hf-repo /[:quant]` (quant default Q4_K_M, mmproj auto-downloaded; example `ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M`); `-hff, --hf-file`; `-hft, --hf-token` (or `HF_TOKEN`); `-dr, --docker-repo`; `-cl, --cache-list`; `--offline` forces cache use. - Context / generation: `-c, --ctx-size N` (default 0 = loaded from model); `-n, --predict, --n-predict N` (default -1 = infinity); `-b, --batch-size N` (default 2048); `-ub, --ubatch-size N` (default 512); `--keep N` (default 0, -1 = all); `-np, --parallel N` (default 1). - GPU / offload: `-ngl, --gpu-layers, --n-gpu-layers N` (default `auto`; accepts exact number, `auto`, or `all`); `-sm, --split-mode {none,layer,row,tensor}` (default layer); `-ts, --tensor-split N0,N1,...`; `-mg, --main-gpu INDEX` (default 0); `-dev, --device`; `--list-devices`; `-cmoe, --cpu-moe` and `-ncmoe, --n-cpu-moe N` (keep MoE weights on CPU); `-fit, --fit [on|off]` (default on) auto-adjusts unset args to fit device memory. - Flash attention: `-fa, --flash-attn [on|off|auto]` (default auto). - KV cache: `-ctk, --cache-type-k TYPE` / `-ctv, --cache-type-v TYPE` (default f16; allowed f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1); `-kvo/--kv-offload` vs `-nkvo/--no-kv-offload` (default enabled); `-dt, --defrag-thold` (DEPRECATED); `-cram, --cache-ram N` MiB (default 8192); `-ctxcp, --ctx-checkpoints, --swa-checkpoints N` (default 32); `--swa-full`; `--context-shift / --no-context-shift` (default disabled). - RoPE/YaRN context extension: `--rope-scaling {none,linear,yarn}`, `--rope-scale`, `--rope-freq-base`, `--rope-freq-scale`, and YaRN flags `--yarn-orig-ctx`, `--yarn-ext-factor` (default -1.00), `--yarn-attn-factor`, `--yarn-beta-slow`, `--yarn-beta-fast`. - Threads / CPU: `-t, --threads N` (default -1); `-tb, --threads-batch` (default = --threads); CPU affinity `-C/--cpu-mask`, `-Cr/--cpu-range`, `--cpu-strict`, `--prio`, `--poll` (default 50); `--numa `. - Sampling defaults (verbatim): `--temp` 0.80; `--top-k` 40 (0=disabled); `--top-p` 0.95 (1.0=disabled); `--min-p` 0.05 (0.0=disabled); `--typical/--typical-p` 1.00 (disabled); `--top-n-sigma` -1.00 (disabled); `--xtc-probability` 0.00, `--xtc-threshold` 0.10; `--repeat-last-n` 64, `--repeat-penalty` 1.00 (disabled), `--presence-penalty` 0.00, `--frequency-penalty` 0.00; DRY: `--dry-multiplier` 0.00, `--dry-base` 1.75, `--dry-allowed-length` 2, `--dry-penalty-last-n` -1; `--mirostat` 0 (1=v1, 2=v2.0), `--mirostat-lr` 0.10, `--mirostat-ent` 5.00; `--dynatemp-range` 0.00, `--dynatemp-exp` 1.00; `-s, --seed` -1 (random). - Default sampler chain (`--samplers`): `penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature`. Default `--sampling-seq` short form: `edskypmxt`. New adaptive-p sampler: `--adaptive-target` (default -1.00, disabled), `--adaptive-decay` (default 0.90). - Structured output: `--grammar GRAMMAR` (BNF-like, see grammars/ dir); `--grammar-file FNAME`; `-j, --json-schema SCHEMA`; `-jf, --json-schema-file`. Also `-l, --logit-bias`, `--ignore-eos`, `-bs, --backend-sampling` (experimental). - Conversation / interactive: `-cnv, --conversation` / `-no-cnv, --no-conversation` (conversation mode auto-enabled if a chat template is available; also enables interactive mode); `-st, --single-turn`; `-r, --reverse-prompt`; `-mli, --multiline-input`; `-sys, --system-prompt PROMPT` and `-sysf, --system-prompt-file`; `-i`/interactive is implied by `-cnv`. - Prompt input: `-p, --prompt`; `-f, --file`; `-bf, --binary-file`; `-e/--escape` (default true); `--verbose-prompt`; `--display-prompt`/`--no-display-prompt` (default true); `-sp, --special`. - Chat templating: `--jinja/--no-jinja` (default enabled); `--chat-template JINJA_TEMPLATE` (large list of built-in templates incl. chatml, llama2/3/4, gemma, mistral-v*, deepseek*, gpt-oss, granite, qwen-family etc.); `--chat-template-file`; `--chat-template-kwargs`; `--reasoning-format`, `-rea/--reasoning`, `--reasoning-budget`. - Multimodal: `-mm, --mmproj FILE` (mtmd projector); `-mmu, --mmproj-url`; `--mmproj-auto/--no-mmproj`; `--image, --audio FILE`; `--image-min-tokens`/`--image-max-tokens`. - Speculative decoding: `--spec-draft-model, -md, --model-draft`; `--spec-draft-hf, -hfd`; `--spec-draft-n-max` (default 3), `--spec-draft-n-min` (default 0); `--spec-type` (none, draft-simple, draft-eagle3, draft-mtp, ngram-*); plus many `--spec-ngram-*` tuning flags. NOTE: `--draft`/`--draft-max` and `--draft-min` have been REMOVED in favor of `--spec-draft-n-max`/`--spec-ngram-mod-n-max` etc. - LoRA & control vectors: `--lora FNAME`, `--lora-scaled FNAME:SCALE`, `--control-vector`, `--control-vector-scaled`, `--control-vector-layer-range`. - Quick-start presets: `--gpt-oss-20b-default`, `--gpt-oss-120b-default`, `--vision-gemma-4b-default`, `--vision-gemma-12b-default`, `--spec-default` (download weights from internet). ## Relevant Concepts - [[entities/binary-llama-cli]] — this file is the primary flag reference for the binary. - [[concepts/sampling-parameters]] — full sampler list, defaults, and default chain order documented here. - [[concepts/kv-cache-and-context]] — ctx-size, cache types, KV offload, context shift, RoPE/YaRN, SWA checkpoints. - [[concepts/gbnf-grammars]] — `--grammar`, `--grammar-file`, `--json-schema` constrained generation. - [[concepts/build-and-backends]] — GPU offload, split-mode, devices, flash-attn, NUMA. - [[entities/binary-llama-cli]] — the binary this reference documents. ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + tools/cli/README.md - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/tools/cli/README.md --- title: "Artefact2's canonical GGUF quant KL-divergence / PPL / bpw table (Mistral-7B)" type: summary tags: [quantization, imatrix, accuracy, community, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-artefact2-quant-table.md"] confidence: medium llama_build: "n/a (community source; KL table 2024-02-27, ROCm bench 2024-03-15)" source_url: "https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9" --- # Artefact2's canonical GGUF quant KL-divergence / PPL / bpw table (Mistral-7B) ## Key Points - THE reference table the rest of the community (bartowski, mradermacher) links to. Measures every GGUF quant against unquantized Mistral-7B. imatrix from wiki.train (200×512 tokens); KL-div measured on wiki.test. - **Top-level rule of thumb:** "use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters." (i.e. above ~Q4_K_S, spend bytes on more params, not more bits.) - **KL-divergence table (quant | bpw | KL median | KL q99 | top-tokens-differ | ln(PPL(Q)/PPL(base))):** - IQ1_S | 1.78 | 0.5495 | 5.5174 | 0.3840 | 0.9235 - IQ2_XXS | 2.20 | 0.1751 | 2.4983 | 0.2313 | 0.2988 - IQ2_XS | 2.43 | 0.1146 | 1.7693 | 0.1943 | 0.2046 - IQ2_S | 2.55 | 0.0949 | 1.6284 | 0.1806 | 0.1722 - IQ2_M | 2.76 | 0.0702 | 1.0935 | 0.1557 | 0.1223 - Q2_K_S | 2.79 | 0.0829 | 1.5111 | 0.1735 | 0.1600 - Q2_K | 3.00 | 0.0588 | 1.0337 | 0.1492 | 0.1103 - IQ3_XXS | 3.21 | 0.0330 | 0.5492 | 0.1137 | 0.0589 - IQ3_XS | 3.32 | 0.0296 | 0.4550 | 0.1071 | 0.0458 - Q3_K_S | 3.50 | 0.0304 | 0.4481 | 0.1068 | 0.0511 - IQ3_S | 3.52 | 0.0205 | 0.3018 | 0.0895 | 0.0306 - IQ3_M | 3.63 | 0.0186 | 0.2740 | 0.0859 | 0.0268 - Q3_K_M | 3.89 | 0.0171 | 0.2546 | 0.0839 | 0.0258 - Q3_K_L | 4.22 | 0.0152 | 0.2202 | 0.0797 | 0.0205 - IQ4_XS | 4.32 | 0.0088 | 0.1082 | 0.0606 | 0.0079 - IQ4_NL | 4.56 | 0.0085 | 0.1077 | 0.0605 | 0.0074 - Q4_K_S | 4.57 | 0.0083 | 0.1012 | 0.0600 | 0.0081 - Q4_K_M | 4.83 | 0.0075 | 0.0885 | 0.0576 | 0.0060 - Q5_K_S | 5.52 | 0.0045 | 0.0393 | 0.0454 | 0.0005 - Q5_K_M | 5.67 | 0.0043 | 0.0368 | 0.0444 | 0.0005 - Q6_K | 6.57 | 0.0032 | 0.0222 | 0.0394 | −0.0008 - **Key reads:** KL-div drops ~170× from IQ1_S (0.5495) to Q6_K (0.0032). IQ-quants beat similarly-sized K-quants: IQ3_S (3.52bpw, KL 0.0205) beats Q3_K_S (3.50bpw, KL 0.0304); IQ2_M (2.76bpw, KL 0.0702) beats Q2_K_S (2.79bpw, KL 0.0829). The "knee" is around Q4: Q4_K_M (4.83bpw) hits KL 0.0075 / ln-PPL 0.0060 — near-lossless. Q5/Q6 gains are marginal (ln-PPL ≈ 0). - **ROCm throughput table (Mistral-7B, tok/s):** -ngl 99 (full GPU) vs -ngl 0 (CPU). e.g. Q4_0 3.83GiB: pp512 870.44 / tg128 63.42 (GPU); IQ1_S 1.50GiB: tg128 74.85 (GPU); Q8_0 7.17GiB: pp512 881.95 / tg128 39.74; f16 13.49GiB much slower. Q4_0 and Q8_0 have the fastest prompt-processing; tg128 falls as bpw rises. ## Relevant Concepts - [[concepts/quantization]] — empirical bpw→quality curve underpinning every quant recommendation. - [[concepts/imatrix]] — table is built with imatrix-weighted quants; demonstrates IQ-beats-K at equal size. - [[concepts/build-and-backends]] — second table is ROCm (AMD) throughput across GPU/CPU offload. - [[entities/binary-llama-quantize]] — produces the quant types benchmarked here. - [[entities/binary-imatrix]] — generates the wiki.train importance matrix used for these quants. ## Source Metadata - Type: community (GitHub Gist) - Author/platform: Artefact2 / GitHub - Date: KL table 2024-02-27; ROCm bench 2024-03-15. FLAG: STALE — Feb/Mar 2024, predates newer quant types (e.g. Q4_K_XL/UD variants) and only covers Mistral-7B; absolute numbers are model-specific but the ranking/shape generalizes. - URL: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9 --- title: "Which Quantization Should I Use? Unified llama.cpp Eval on Llama-3.1-8B-Instruct" type: summary tags: [quantization, imatrix, accuracy, community, advanced] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-arxiv-quant-eval.md"] confidence: medium llama_build: "n/a (community source, 2026-01-11)" source_url: "https://arxiv.org/abs/2601.14277" --- # Which Quantization Should I Use? (Kurt 2026, arXiv 2601.14277) Single-author arXiv preprint (Uygar Kurt, 2026-01-11, cs.LG) benchmarking llama.cpp 3–8 bit K-quant and legacy formats on Llama-3.1-8B-Instruct (FP16 GGUF). 17 pages, 6 tables, 1 figure. Does NOT cover Q2_K or IQ/I-quant formats. ## Key Points - **F16 baseline:** Avg 69.47, perplexity 7.32. (GSM8K 77.63 / HellaSwag 72.51 / IFEval 78.93 / MMLU 63.50 / TruthfulQA-MC2 54.79.) - **Perplexity by quant:** Q3_K_S 8.96, Q3_K_M 7.96, Q3_K_L 7.81, Q4_0 7.74, Q4_1 7.72, Q4_K_S 7.62, Q4_K_M 7.56, Q5_0/Q5_1/Q5_K_S 7.43, Q5_K_M 7.40, Q6_K 7.35, Q8_0 7.33. - **Avg benchmark score:** Q3_K_S 65.49 (sharp drop), Q3_K_M 68.07, Q3_K_L 68.78, Q4_K_M 69.15, Q4_K_S 69.17, Q5_0 69.92 (highest, edges out F16's 69.47 — within noise), Q5_K_M 69.36, Q6_K 69.23, Q8_0 69.41. - **Size reduction vs F16:** Q3_K_S −77.23%, Q4_K_M −69.41%, Q5_K_M −64.35%, Q6_K −58.98%, Q8_0 −46.87%. F16 = 15317 MiB; Q4_K_M = 4685 MiB; Q8_0 = 8138 MiB. - **CPU throughput (tok/s):** F16 prefill pp512 = 79.57, decode tg128 = 2.83. Quantized decode is much faster (tg128 4.36–9.91). Notable prefill: Q4_0 pp512 = 97.35 and Q4_K_S = 92.52 (fastest prefill); Q3_K_S has highest decode tg128 = 9.91. - **Quant time (s):** legacy/Q3/Q6/Q8 ~27–32 s; K-quant Q4/Q5 slower (Q4_K_S 42.84, Q4_K_M 42.19, Q5_K_S 37.47). - **Takeaway:** Below ~Q3 (Q3_K_S) accuracy degrades sharply (avg 65.49, ppl 8.96); Q4_K and up stay within ~0.5 avg points of F16; Q6_K/Q8_0 are near-lossless. Q4_K_M is the practical size/quality sweet spot. ## Relevant Concepts - [[concepts/quantization]] - [[concepts/gguf-format]] - [[entities/binary-llama-quantize]] ## Source Metadata - **Type:** arXiv preprint (single-author, NOT peer reviewed — treat results as indicative, one model only). - **Author:** Uygar Kurt. - **Date:** 2026-01-11 — recent; current Llama-3.1 era, not stale. - **URL:** https://arxiv.org/abs/2601.14277 (DOI 10.48550/arXiv.2601.14277). - **Confidence:** medium. Numbers transcribed from the arXiv HTML by an extraction model; verify against the PDF before load-bearing citation. No KL-divergence figures were surfaced in the captured tables. --- title: "Bartowski's 'Which file should I choose?' quant decision guide (Qwen3-8B GGUF)" type: summary tags: [quantization, imatrix, accuracy, community, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-bartowski-quant-guide.md"] confidence: medium llama_build: "n/a (community source; quantized with llama.cpp b5200)" source_url: "https://huggingface.co/bartowski/Qwen_Qwen3-8B-GGUF" --- # Bartowski's 'Which file should I choose?' quant decision guide (Qwen3-8B GGUF) ## Key Points - Canonical community decision tree for picking a GGUF quant. Boilerplate that appears on nearly every bartowski model card. - **Sizing rule:** figure out RAM+VRAM budget first. For max SPEED, fit the whole model in VRAM — pick a quant file 1-2GB smaller than total VRAM. For max QUALITY, add system RAM + VRAM, pick a quant 1-2GB smaller than that total. - **K-quant vs I-quant rule:** if you don't want to think, grab a K-quant (`QX_K_X` format, e.g. Q5_K_M). If aiming below Q4 AND running cuBLAS (Nvidia) or rocBLAS (AMD), use I-quants (`IQX_X`, e.g. IQ3_M) — newer, better performance per size. I-quants run on CPU too but slower than K-quant equivalents (speed/quality tradeoff). - **Recommended quants** (marked *recommended* on card): Q6_K_L, Q6_K, Q5_K_L, Q5_K_M, Q5_K_S, Q4_K_L, Q4_K_M (default for most use cases), Q4_K_S, IQ4_XS. - **`_L` suffix quants** (Q6_K_L, Q5_K_L, Q4_K_L, Q3_K_XL, Q2_K_L) = standard quant with embed + output weights bumped to Q8_0. Credited to ZeroWw inspiration. - Quant sizes for Qwen3-8B: bf16 16.39GB; Q8_0 8.71GB; Q6_K 6.73GB; Q4_K_M 5.03GB; Q4_K_S 4.80GB; IQ4_XS 4.56GB; Q3_K_M 4.12GB; IQ3_M 3.90GB; Q2_K 3.28GB; IQ2_M 3.05GB. - Quality labels: Q4_K_M "Good quality, default size"; IQ4_XS "Decent quality, smaller than Q4_K_S with similar performance"; IQ3_M "comparable to Q3_K_M"; Q2_K "Very low quality but surprisingly usable"; Q3_K_S "Low quality, not recommended." - **ARM/AVX:** Q4_0_X_X interleaved formats are deprecated as of llama.cpp **b4282** — use Q4_0 instead, which does "online repacking" on the fly (PR #9921). IQ4_NL also repacks for ARM (PR #10541, 4_4 only) for slightly better quality. - All quants made with the imatrix option using bartowski's calibration dataset (kalomaze + Dampf assisted). Defers detailed quality charts to Artefact2's gist. ## Relevant Concepts - [[concepts/quantization]] — full K-quant / I-quant / legacy-format taxonomy and the size/quality tradeoff this card operationalizes. - [[concepts/imatrix]] — all these quants are imatrix-weighted; embed/output-to-Q8_0 trick is layered on top. - [[concepts/gguf-format]] — file-per-quant distribution model; multi-part split for >50GB. - [[concepts/build-and-backends]] — cuBLAS/rocBLAS gating of I-quants; ARM/AVX online repacking tied to build b4282. - [[entities/binary-llama-quantize]] — the tool that produces these `QX_K_X` / `IQX_X` outputs. ## Source Metadata - Type: community (HF model card) - Author/platform: bartowski / Hugging Face - Date: unknown exact; Qwen3-8B + llama.cpp b5200 era (~2025). FLAG: undated; ARM advice tied to specific builds (b4282) and may drift. - URL: https://huggingface.co/bartowski/Qwen_Qwen3-8B-GGUF --- title: "Apple Silicon llama-bench Scoreboard (M1–M5, LLaMA 7B, F16/Q8_0/Q4_0)" type: summary tags: [benchmarking, performance, kv-cache, community, metal] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-bench-apple-silicon.md"] confidence: medium llama_build: "n/a (community source, base build commit 8e672ef dated 2023-11-21; later chip rows added on varying builds)" source_url: "https://github.com/ggml-org/llama.cpp/discussions/4167" --- # Apple Silicon llama-bench Scoreboard (M1–M5, LLaMA 7B, F16/Q8_0/Q4_0) Canonical community-maintained `llama-bench` table for Apple Silicon. Model: LLaMA 7B (v2). PP = prompt processing at `-p 512` (batch 512, compute-bound). TG = text generation at `-n 128` (batch 1, bandwidth-bound). All layers on GPU (`-ngl 99`). Base commit `8e672ef` (2023-11-21); later M3/M4/M5 rows contributed over time on differing builds. ## Key Points - **M1 (8-core GPU, 68 GB/s):** Q4_0 PP 117.96 t/s, Q4_0 TG 14.15 t/s. [LLaMA 7B, commit 8e672ef] - **M1 Max (32-core, 400 GB/s):** F16 PP 599.53 / TG 23.03; Q4_0 PP 530.06 / TG 61.19 t/s. - **M1 Ultra (64-core, 800 GB/s):** F16 PP 1168.89 / TG 37.01; Q4_0 PP 1030.04 / TG 83.73 t/s. - **M2 Ultra (76-core, 800 GB/s):** F16 PP 1401.85 / TG 41.02; Q8_0 PP 1248.59 / TG 66.64; Q4_0 PP 1238.48 / TG 94.27 t/s — highest TG in the table. - **M3 Max (40-core, 400 GB/s):** F16 PP 779.17 / TG 25.09; Q4_0 PP 759.70 / TG 66.31 t/s. - **M3 Ultra (80-core, 800 GB/s):** F16 PP 1538.34 / TG 39.78; Q8_0 PP 1487.51 / TG 63.93; Q4_0 PP 1471.24 / TG 92.14 t/s — highest PP in the table. - **M4 (10-core, 120 GB/s):** F16 PP 230.18 / TG 7.43; Q4_0 PP 221.29 / TG 24.11 t/s. - **M4 Max (40-core, 546 GB/s):** F16 PP 922.83 / TG 31.64; Q8_0 PP 891.94 / TG 54.05; Q4_0 PP 885.68 / TG 83.06 t/s. - **M4/M5 coverage incomplete:** M4 Ultra and all M5 rows unfilled (✗) as of fetch. - **Regime distinction:** PP (compute-bound) is *higher* for F16 than for quantized models because PP is FLOPS-limited; TG (bandwidth-bound) is *higher* for Q4_0 (less data to move per token). E.g. M1 Ultra 64c: F16 PP 1168.89 but TG only 37.01, vs Q4_0 TG 83.73. ## Relevant Concepts - [[concepts/kv-cache-and-context]] - [[entities/binary-llama-bench]] - [[entities/backend-metal]] - [[concepts/quantization]] - [[concepts/build-and-backends]] ## Source Metadata - **Type:** Community GitHub Discussion (#4167), living/maintained scoreboard. - **Author:** ggml-org / llama.cpp community contributors. - **Date:** Base commit 2023-11-21 (`8e672ef`); STALENESS FLAG — table is ~2.5 yrs old at base, later rows added incrementally; absolute t/s figures predate many Metal kernel improvements and should be treated as point-in-time floors, not current performance. - **Hardware:** Apple Silicon M1 through M5 variants (per-row GPU-core count and memory bandwidth listed). - **Build:** `8e672ef` for base; per-row build not individually annotated in source. - **URL:** https://github.com/ggml-org/llama.cpp/discussions/4167 --- title: "NVIDIA CUDA llama-bench Scoreboard (Llama 2 7B Q4_0, pp512 vs tg128, FA on/off)" type: summary tags: [benchmarking, performance, kv-cache, community, cuda] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-bench-nvidia-cuda.md"] confidence: medium llama_build: "n/a (community source, per-row commits vary: 8cf6b42, 5143fa8, 79c1160, c76b420, etc.)" source_url: "https://github.com/ggml-org/llama.cpp/discussions/15013" --- # NVIDIA CUDA llama-bench Scoreboard (Llama 2 7B Q4_0, pp512 vs tg128, FA on/off) Canonical community CUDA scoreboard. Model: Llama 2 7B Q4_0. Command: `llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1`. Two metrics, two FA states. ## Key Points (all numbers: Llama 2 7B Q4_0, per-row commit) **pp512 = PROMPT PROCESSING (compute-bound, thousands of t/s); tg128 = TEXT GENERATION (bandwidth-bound, tens-to-low-hundreds of t/s). Do not conflate.** - **RTX 5090 (32 GB GDDR7):** no-FA pp512 14073.41 / tg128 290.02; with-FA pp512 14970.15 / tg128 300.40. [commit 8cf6b42] - **RTX PRO 6000 Blackwell (96 GB):** no-FA pp512 14854.63 / tg128 274.20; with-FA pp512 16618.98 / tg128 281.11. [79c1160 / 5143fa8] — highest pp512 in table (with FA). - **H100 80 GB:** no-FA pp512 9918.34 / tg128 267.81; with-FA pp512 11263.29 / tg128 280.74. [5143fa8] - **A100 80 GB:** no-FA pp512 4849.53 / tg128 190.88; with-FA pp512 5285.96 / tg128 200.90. [5143fa8] - **RTX 4090 (24 GB):** no-FA pp512 11992.70 / tg128 186.21; with-FA pp512 14770.63 / tg128 188.96. [2241453] — FA boosts pp512 ~+23%. - **RTX 3090 (24 GB):** no-FA pp512 5174.69 / tg128 158.16; with-FA pp512 5560.06 / tg128 161.89. [c76b420] - **RTX 3090 Ti:** no-FA pp512 6567.49 / tg128 171.19; with-FA pp512 6924.01 / tg128 172.26. [9c35706] - **DGX Spark (128 GB LPDDR5x):** no-FA pp512 3062.31 / tg128 57.21; with-FA pp512 3661.37 / tg128 56.74. [5acd455] - **FA effect:** Flash Attention generally raises pp512 substantially on Blackwell/Ada/Hopper (e.g. 4090 pp512 11992→14770) and modestly lifts or holds tg128. On some older cards (e.g. Tesla V100) FA slightly lowers pp512. - **Caveat:** Quadro T1000 with-FA pp512 listed as 27.46 (= its tg128) — likely a source transcription anomaly; treat with caution. ## Relevant Concepts - [[concepts/kv-cache-and-context]] - [[entities/binary-llama-bench]] - [[entities/backend-cuda]] - [[concepts/quantization]] - [[concepts/build-and-backends]] ## Source Metadata - **Type:** Community GitHub Discussion (#15013), living/maintained CUDA scoreboard. - **Author:** ggml-org / llama.cpp community contributors (one row per contributor). - **Date:** STALENESS FLAG — undated as a whole; per-row commit hashes span multiple 2025-era builds; rows were measured on different builds, so cross-GPU comparison carries build-skew. Not a single-build snapshot. - **Hardware:** Broad NVIDIA range — Blackwell (5090, PRO 6000), Hopper (H100), Ampere (A100, 3090/Ti), Ada (4090, 6000 Ada), down to Pascal/Kepler (GTX 1060, Tesla K80), plus DGX Spark and Jetson AGX Orin. - **Build:** Per-row commit in last column (e.g. 8cf6b42, 5143fa8, 79c1160, c76b420). - **URL:** https://github.com/ggml-org/llama.cpp/discussions/15013 --- title: "Community Benchmarks & Quant Guides — Catalog" type: summary tags: [catalog, benchmarks, quantization, community] updated: 2026-06-09 confidence: medium sources: [raw/community/community-bench-apple-silicon.md, raw/community/community-bench-nvidia-cuda.md, raw/community/community-dgxspark-kv-quant.md, raw/community/community-bartowski-quant-guide.md, raw/community/community-kaitchup-gguf-guide.md, raw/community/community-steelphoenix-guide.md, raw/community/community-unsloth-dynamic-ggufs.md, raw/community/community-mradermacher-imatrix.md, raw/community/community-artefact2-quant-table.md, raw/community/community-arxiv-quant-eval.md, raw/community/community-pr1684-kquants.md, raw/community/community-smcleod-kv-quant.md, raw/community/community-gh15180-vllm-vs-llamacpp.md, raw/community/community-redhat-vllm-vs-llamacpp.md, raw/community/community-hf-gguf-usage.md] --- # Community Benchmarks & Quant Guides — Catalog Map of the 15 provenance-stamped community sources in `raw/community/` (each carries its source URL + author). Use this to answer "is there data on X?" — details live in the cited files and the synthesis pages. ## Hardware benchmarks - **Apple Silicon** performance discussion (llama.cpp GitHub discussions) - **NVIDIA CUDA** benchmarks - **DGX Spark KV-cache quantization** measurements ## Quantization guides & evaluations - **bartowski's quant guide** (HF model-card conventions — the de-facto community quant naming) - **Kaitchup GGUF guide** · **SteelPhoenix guide** · **HF GGUF usage docs** - **Unsloth dynamic GGUFs** (dynamic quantization approach) - **mradermacher imatrix** practices - **artefact2 quant comparison table** · **arXiv quant evaluation** (academic eval) · **PR #1684** (the original k-quants design) ## KV-cache quantization - **smcleod KV-quant guide** + the DGX Spark measurements above ## Engine comparisons - **vLLM vs llama.cpp**: GitHub issue #15180 thread + Red Hat's comparison ([[syntheses/llamacpp-vs-vllm]] synthesizes these) Related: [[syntheses/quant-types-compared]] · [[concepts/cli-and-tools-reference]] (llama-bench). --- title: "DGX Spark KV-Quant Benchmarks (Mar 2026, build 8399): q4_0 used MORE memory + ~92% slower — use q8_0" type: summary tags: [benchmarking, performance, kv-cache, community, cuda] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-dgxspark-kv-quant.md"] confidence: medium llama_build: "n/a (community source, llama.cpp build 8399, aarch64+CUDA, 2026-03-31)" source_url: "https://forums.developer.nvidia.com/t/kv-cache-quantization-benchmarks-on-dgx-spark-q4-0-vs-q8-0-vs-f16-llama-cpp-nemotron-30b-128k-context/365138" --- # DGX Spark KV-Quant Benchmarks (Mar 2026, build 8399): q4_0 used MORE memory + ~92% slower — use q8_0 NVIDIA DGX Spark forum benchmark of KV-cache quant (q4_0 vs f16, llama.cpp build 8399) on Nemotron-3-Nano-30B-A3B (Q4_K_XL MoE), 128K context window. CONFLICTS with the common "q4_0 KV saves ~66% VRAM" wisdom: here q4_0 used *more* memory than f16 and collapsed at long context. ## Key Points (all: DGX Spark GB10 / 128 GB unified LPDDR5x, build 8399, Nemotron-30B-A3B Q4_K_XL, 128K ctx) - **Prompt processing (t/s), f16 vs q4_0:** ~8K 371.3 vs 363.4 (-2.1%); ~16K 360.7 vs 346.2 (-4.0%); ~32K 328.3 vs 316.9 (-3.5%); **~64K 282.7 vs 21.3 (-92.5% — collapse)**. - **Generation (t/s), f16 vs q4_0:** ~8K 14.7 vs 14.2 (-3.4%); ~16K 13.9 vs 12.7 (-8.6%); ~32K 13.5 vs 11.0 (-18.5%); **~64K 13.3 vs 8.6 (-35.3%)**. - **Memory (RSS), f16 vs q4_0:** ~8K 1.25 vs 1.34 GB (+7%); ~32K 1.59 vs 1.69 GB (+6%); **~64K 1.94 vs 2.06 GB (+6%)** — q4_0 KV uses MORE memory, not less. - **Cause (per author):** q4_0 dequantization overhead during prompt processing tanks PP at 64K; metadata/working-buffer overhead exceeds compression gains on this unified-memory + MoE + build combo, so q4_0 ends up larger than f16. - **Recommendation:** <16K → f16 default; 16–64K → `--cache-type-k q8_0 --cache-type-v q8_0`; 64K+ → wait for TurboQuant or use TRT-LLM + NVFP4. Author: **"q8_0 is the only quantization worth running"** (2x compression, <5% speed hit across all ctx). ## Relevant Concepts - [[concepts/kv-cache-and-context]] - [[concepts/quantization]] - [[entities/backend-cuda]] - [[entities/binary-llama-bench]] - [[concepts/build-and-backends]] ## Source Metadata - **Type:** Community NVIDIA Developer Forum thread (DGX Spark / GB10 Projects). - **Author:** forum user "nmaine". - **Date:** 2026-03-31. STALENESS FLAG — recent at fetch but tied to a single evolving build (8399); MoE-specific behavior may change as KV-quant paths are fixed. - **Hardware:** DGX Spark (GB10, compute 12.1, 128 GB unified LPDDR5x). Results are platform/build/model-specific — do NOT generalize the q4_0 anomaly to discrete-GPU CUDA or Metal. - **Build:** llama.cpp build 8399 (aarch64 + CUDA). - **URL:** https://forums.developer.nvidia.com/t/kv-cache-quantization-benchmarks-on-dgx-spark-q4-0-vs-q8-0-vs-f16-llama-cpp-nemotron-30b-128k-context/365138 --- title: "GH #15180: llama.cpp vs vLLM Head-to-Head (RTX 4090, Qwen2.5-3B)" type: summary tags: [comparison, deployment, performance, community, vs-vllm, vs-ollama] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-gh15180-vllm-vs-llamacpp.md"] confidence: medium llama_build: "n/a (community source, 2025-08-08)" source_url: "https://github.com/ggml-org/llama.cpp/discussions/15180" --- # GH #15180: llama.cpp vs vLLM Head-to-Head (RTX 4090, Qwen2.5-3B) llama.cpp GitHub discussion (JohannesGaessler, Aug 8 2025): a careful, matched-conditions benchmark on one RTX 4090. Result: llama.cpp took 93.6-100.2% of vLLM's time per request single-stream (i.e. slightly faster or equal), and 99.2-125.6% at 16 parallel requests (i.e. up to ~25% slower under concurrency). This is the fair apples-to-apples counterpoint to vendor benchmarks. ## Key Points - **Setup:** Single RTX 4090 (clock-locked at 1350 MHz), Qwen2.5-3B-Instruct. vLLM in BF16, llama.cpp in FP16. `llama-server -ngl 999 -fa -c 440000 --parallel 1` vs `vllm serve Qwen/Qwen2.5-3B-Instruct`. 32x parallel-count requests/run, 6 runs averaged. Max context 31744 tok (1 req) / 25600 tok (16 parallel). - **Single request:** llama.cpp needed **93.6-100.2%** of vLLM's runtime — i.e. mostly a 3-6% speed advantage for llama.cpp, converging to roughly even at deepest contexts. - **16 parallel requests:** llama.cpp needed **99.2-125.6%** of vLLM's runtime — vLLM faster by 0.2% up to 25.6%, with the gap widest at short context / high gen tokens and shrinking (even crossing in llama.cpp's favor, -0.8%) at very deep contexts. - **Why the concurrency gap:** author attributes it to llama.cpp's higher constant runtime per generated token; suggests moving samplers (top-k/top-p/min-p) into the ggml graph, more op fusion, and FP16/BF16 graph compute as fixes. - **Caveats noted by author:** vLLM used BF16 (same speed as FP16 for vLLM) while llama.cpp used FP16 due to then-incomplete BF16 support; vLLM CUDA backend crashed at very deep contexts so its deep-context fit is unreliable. - **Later update (Apr 2026 comment, matiaslin):** paged-attention work for llama.cpp scales to 247 concurrent sequences (vs unified cache OOM at 26), reaching 2.5x aggregate throughput, and is within ~3% of the unified approach at equal concurrency — narrowing the high-concurrency gap with vLLM. - **Takeaway:** Single-user/low-concurrency, llama.cpp is competitive-to-faster than vLLM on the same GPU. vLLM's edge is specifically batched concurrency, and even that is modest (~25% max here) on a small dense model — far from the 35-44x in vendor large-scale-concurrency tests. ## Relevant Concepts - [[concepts/server-api]] - [[entities/binary-llama-server]] - [[entities/project-llama-cpp]] - [[concepts/build-and-backends]] - [[concepts/quantization]] ## Source Metadata - **Type:** Community source — GitHub discussion thread on ggml-org/llama.cpp. - **Author:** JohannesGaessler (llama.cpp contributor; no vendor incentive to favor vLLM, which strengthens credibility of the result being unflattering to llama.cpp at scale). - **Date:** 2025-08-08 ⚠ FLAG: ~10 months stale relative to this KB (2026-05-30); superseded in part by later paged-attention work referenced in-thread (Apr 2026). - **URL:** https://github.com/ggml-org/llama.cpp/discussions/15180 - No vendor bias flag; this is a self-critical first-party llama.cpp benchmark. --- title: "HF Doc: GGUF usage with llama.cpp (-hf repo-pull, install, server /v1)" type: summary tags: [deployment, build, server-api, community, beginner, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-hf-gguf-usage.md"] confidence: medium llama_build: "n/a (community source, undated living doc)" source_url: "https://huggingface.co/docs/hub/gguf-llamacpp" --- # HF Doc: GGUF usage with llama.cpp Hugging Face's first-party doc for running any compatible GGUF with llama.cpp via the `-hf` repo-pull workflow. Covers install (brew/winget/source), the auto-download + cache behavior (`LLAMA_CACHE` env var), and hitting the OpenAI-compatible server endpoint with curl. Commands captured verbatim. ## Key Points (verbatim commands) **Install — brew (Mac/Linux) or winget (Windows):** ```bash brew install llama.cpp ``` ```bash winget install llama.cpp ``` **Build from source** (add hardware flags, e.g. `-DGGML_CUDA=1`/`-DGGML_CUDA=ON`): ``` git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build # optionally, add -DGGML_CUDA=ON to activate CUDA cmake --build build --config Release ``` For ROCm / SYCL etc., see llama.cpp's build guide. **Run CLI via `-hf` repo:tag pull** (auto-downloads + caches the GGUF): ```bash llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` Add `-no-cnv` for raw completion mode (non-chat). **Run the OpenAI-spec server via `-hf`:** ```bash llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` **Call the OpenAI-compatible chat completions endpoint** (default localhost:8080; auth header accepts a placeholder `no-key`): ```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer no-key" \ -d '{ "messages": [ {"role": "system", "content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."}, {"role": "user", "content": "Write a limerick about Python exceptions"} ] }' ``` Replace `-hf` with any valid HF hub repo name. Cache location is set by the `LLAMA_CACHE` environment variable. GGUFs can also be deployed on HF Inference Endpoints via the llama.cpp container. ## Relevant Concepts - [[concepts/server-api]] - [[concepts/gguf-format]] - [[concepts/build-and-backends]] - [[entities/binary-llama-server]] - [[entities/binary-llama-cli]] - [[entities/project-llama-cpp]] ## Source Metadata - Type: official first-party documentation (Hugging Face Hub docs) - Author: Hugging Face - Date: undated living doc — **STALENESS NOTE**: current and first-party, but unversioned; the `-hf repo:tag` shorthand, `-no-cnv`, the `/v1/chat/completions` endpoint, and the `Authorization: Bearer no-key` placeholder all match current llama-server convention. Build snippet uses minimal `cmake -B build` (omits `-DLLAMA_BUILD_SERVER=ON`, which defaults ON in current builds). The example uses the older `ggerganov/llama.cpp` repo URL (now `ggml-org/llama.cpp`). - URL: https://huggingface.co/docs/hub/gguf-llamacpp --- title: "Choosing a GGUF Model: K-Quants, I-Quants, Legacy Formats (Kaitchup)" type: summary tags: [quantization, imatrix, accuracy, community, advanced] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-kaitchup-gguf-guide.md"] confidence: medium llama_build: "n/a (community source, 2025-10-13)" source_url: "https://kaitchup.substack.com/p/choosing-a-gguf-model-k-quants-i" --- # Choosing a GGUF Model: K-Quants, I-Quants, and Legacy Formats Benjamin Marie's taxonomy/guidance article. **Partially paywalled** — only the free portion was captured; the "Accuracy, Size, and Speed of GGUF Models" section (likely with the numeric tables) is behind the paywall and NOT available. ## Key Points - **Three families:** legacy (Q4_0/Q4_1/Q5_0/Q5_1/Q8_0), K-quants (Q2_K–Q6_K), I-quants (IQ2/IQ3/IQ4). - **Legacy:** per-block linear quant, single scale (symmetric) or scale+zero-point (asymmetric). Q8_0 is "effectively near-lossless for most LLMs." Q4_0/Q4_1 "largely superseded by K- and I-quants for quality per bit." - **K-quants:** two-level scheme, small blocks grouped into super-blocks; S/M/L suffixes selectively raise precision on sensitive tensors. Q4_K_M = "widely useful default for 4-bit"; Q5_K_M = "close to imperceptible degradation"; Q6_K = "almost lossless." - **I-quants:** importance-matrix-based reconstruction with 256-weight super-blocks. IQ4_NL uses 32-weight blocks with non-linear mapping. - **IQ4_XS vs Q4_K_M (Llama-3.1-8B):** IQ4_XS ~4.46 bpw / 4.17 GiB; Q4_K_M ~4.89 bpw / 4.58 GiB. Q4_K_M = "reliable default"; IQ4_XS "more sensitive to how the quant was produced" (i.e., imatrix quality matters). - Author's anecdote: "Q4_K_M variant is always the most downloaded." ## Relevant Concepts - [[concepts/quantization]] - [[concepts/imatrix]] - [[concepts/gguf-format]] ## Source Metadata - **Type:** Community blog (Substack, opinionated practitioner guide). - **Author:** Benjamin Marie (The Kaitchup — AI on a Budget). - **Date:** 2025-10-13 — recent; not stale. - **URL:** https://kaitchup.substack.com/p/choosing-a-gguf-model-k-quants-i - **Confidence:** medium. Free portion is qualitative taxonomy only; no perplexity/KL tables captured (paywalled). Quoted recommendations are the author's, not benchmarked here. --- title: "mradermacher i1/imatrix quant card (Phi-4-reasoning-plus): static vs weighted quants" type: summary tags: [quantization, imatrix, accuracy, community, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-mradermacher-imatrix.md"] confidence: medium llama_build: "n/a (community source, date unknown)" source_url: "https://huggingface.co/mradermacher/Phi-4-reasoning-plus-i1-GGUF" --- # mradermacher i1/imatrix quant card (Phi-4-reasoning-plus): static vs weighted quants ## Key Points - mradermacher publishes two parallel repos per model: **`-i1-GGUF`** = weighted/imatrix quants (this card); **`-GGUF`** (no i1) = static quants. The `i1-` filename prefix marks every imatrix quant. - Standing guidance line on every card: "(sorted by size, not necessarily quality. **IQ-quants are often preferable over similar sized non-IQ quants**)." - The full prose "static vs weighted/imatrix" FAQ is NOT inline in this card — it lives on the mradermacher profile / `model_requests` page. Inline, the static-vs-imatrix and IQ-vs-K tradeoffs are encoded in the per-row Notes column. - **Provided Quants table (Phi-4-reasoning-plus, i1/imatrix), size in GB + Notes:** - i1-IQ1_S 3.4 "for the desperate"; i1-IQ1_M 3.7 "mostly desperate" - i1-IQ2_XXS 4.2; i1-IQ2_XS 4.6; i1-IQ2_S 4.8; i1-IQ2_M 5.2 - i1-Q2_K_S 5.3 "very low quality"; i1-Q2_K 5.6 "IQ3_XXS probably better" - i1-IQ3_XXS 5.9 "lower quality"; i1-IQ3_XS 6.3; i1-IQ3_S 6.6 "beats Q3_K*"; i1-Q3_K_S 6.6 "IQ3_XS probably better"; i1-IQ3_M 7.0; i1-Q3_K_M 7.5 "IQ3_S probably better"; i1-Q3_K_L 8.0 "IQ3_M probably better" - i1-IQ4_XS 8.0; i1-IQ4_NL 8.5 "prefer IQ4_XS"; i1-Q4_0 8.5 "fast, low quality"; i1-Q4_K_S 8.5 "optimal size/speed/quality"; i1-Q4_K_M 9.2 "fast, recommended"; i1-Q4_1 9.4 - i1-Q5_K_S 10.3; i1-Q5_K_M 10.7; i1-Q6_K 12.1 "practically like static Q6_K" - Key encoded rules of thumb: IQ3_S "beats Q3_K*"; at the same/near size IQ-quant beats the K-quant (Q2_K → prefer IQ3_XXS; Q3_K_S → prefer IQ3_XS; Q3_K_M → prefer IQ3_S; Q3_K_L → prefer IQ3_M; IQ4_NL → prefer IQ4_XS). Sweet spots: **Q4_K_S = "optimal size/speed/quality"**, **Q4_K_M = "fast, recommended."** - At Q6_K the imatrix benefit vanishes: "practically like static Q6_K" (imatrix matters most at low bpw). - Card cites the same two external references the community standardizes on: ikawrakow's PPL-vs-quant graph (nethype.de/quantpplgraph.png) and Artefact2's gist. ## Relevant Concepts - [[concepts/imatrix]] — the weighted-vs-static distinction is the whole point of the `-i1-` repos; imatrix gains concentrate at low bpw. - [[concepts/quantization]] — IQ-vs-K-at-equal-size ranking and the per-quant quality ladder. - [[concepts/gguf-format]] — multi-part GGUF concatenation referenced via TheBloke README. - [[entities/binary-imatrix]] — produces the importance matrix backing these i1 quants. - [[entities/binary-llama-quantize]] — consumes the imatrix to emit the i1-* files. ## Source Metadata - Type: community (HF model card) - Author/platform: mradermacher / Hugging Face (nethype GmbH servers; nicoboss supercomputer access) - Date: unknown; Phi-4-reasoning-plus era (~2025). FLAG: undated; note the standard FAQ prose is on a separate page, not this card. - URL: https://huggingface.co/mradermacher/Phi-4-reasoning-plus-i1-GGUF --- title: "k-quants PR #1684 — Origin of K-Quant Perplexity Tables (LLaMA-7B)" type: summary tags: [quantization, imatrix, accuracy, community, advanced] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-pr1684-kquants.md"] confidence: high llama_build: "n/a (foundational/primary source, 2023-06-05)" source_url: "https://github.com/ggml-org/llama.cpp/pull/1684" --- # k-quants PR #1684 — Origin of K-Quant Perplexity Tables This is the **foundational primary source**: ikawrakow's PR that introduced the K-quant family (Q2_K–Q6_K) into llama.cpp. It is the original definitional spec, not a "stale" secondary report. ## Key Points - Introduced 2–6 bit K-quant types with scalar, AVX2, ARM_NEON, and CUDA implementations. - **Bits/weight per type:** Q2_K = 2.5625, Q3_K = 3.4375, Q4_K = 4.5, Q5_K = 5.5, Q6_K = 6.5625. Q8_K (block size 256) used for intermediate results only. - **Block structure:** Q2_K/Q3_K/Q6_K use 16 blocks of 16 weights (super-block 256); Q4_K/Q5_K use 8 blocks of 32 weights. Q2_K = type-1 4-bit scales/mins; Q3_K = type-0 6-bit scales; Q4_K/Q5_K = type-1 6-bit scales/mins; Q6_K = type-0 8-bit scales. - **LLaMA-7B perplexity (F16 = 5.9066):** Q2_K = 6.7764, Q3_K_S = 6.4571, Q3_K_M = 6.1503, Q4_K_S = 6.0215, Q5_K_S = 5.9419, Q6_K = 5.9110. - **LLaMA-7B file sizes (F16 = 13.0G):** Q2_K = 2.67G, Q3_K_S = 2.75G, Q3_K_M = 3.06G, Q4_K_S = 3.56G, Q5_K_S = 4.33G, Q6_K = 5.15G. - **Defining claim:** "The 6-bit quantized perplexity is within 0.1% or better from the original fp16 model." (Q6_K 5.9110 vs F16 5.9066 = +0.074%.) - Mixed quantization: Q2_K uses Q4_K for `attention.vw` and `feed_forward.w2`; S/M/L suffixes raise precision on selective tensors. ## Relevant Concepts - [[concepts/quantization]] - [[concepts/gguf-format]] - [[entities/binary-llama-quantize]] ## Source Metadata - **Type:** Primary source / foundational spec (GitHub PR, merged into llama.cpp). - **Author:** Iwan Kawrakow (@ikawrakow). - **Date:** 2023-06-05 — FOUNDATIONAL, not stale; this is the origin spec for K-quants. Numbers are LLaMA-1 7B era and should not be read as current model accuracy. - **URL:** https://github.com/ggml-org/llama.cpp/pull/1684 - **Confidence:** high (definitional primary source). --- title: "Red Hat: vLLM or llama.cpp - Choosing the Right Inference Engine" type: summary tags: [comparison, deployment, performance, community, vs-vllm, vs-ollama] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-redhat-vllm-vs-llamacpp.md"] confidence: medium llama_build: "n/a (community source, 2025-09-30; benchmarked llama.cpp b6100)" source_url: "https://developers.redhat.com/articles/2025/09/30/vllm-or-llamacpp-choosing-right-llm-inference-engine-your-use-case" --- # Red Hat: vLLM or llama.cpp - Choosing the Right Inference Engine Red Hat Developer article (Harshith Umesh, Sep 30 2025) benchmarking vLLM v0.10.0 vs llama.cpp b6100 on a single NVIDIA H200 across rising concurrency. Headline: at peak load vLLM delivered >35x the request throughput and >44x the output tokens/sec of llama.cpp. This is a concurrency-at-scale result, not a per-request speed claim. ## Key Points - **Setup:** Single NVIDIA H200-PCIe-141GB, CUDA 12.8, OpenShift 4.18.9. vLLM v0.10.0 serving Llama-3.1-8B-Instruct (bfloat16); llama.cpp b6100 serving the same model as F16 GGUF (bartowski quant) with `-ngl 99` and 64 threads/threads-batch. Benchmarked with GuideLLM v0.2.1, 1 to 64 simultaneous users, 300s per concurrency level. - **Headline numbers:** "At peak load, vLLM delivered more than 35 times the request throughput (RPS) and more than 44 times the total output tokens per second (TPS) compared to llama.cpp." - **TTFT:** vLLM P99 time-to-first-token stays nearly flat as concurrency rises; llama.cpp P99 TTFT rises exponentially under concurrent load. - **ITL:** vLLM has lower inter-token latency at low concurrency (1-4 users); llama.cpp keeps an extremely low and stable ITL at higher loads. - **Framing in article:** vLLM is built for high-throughput multi-user serving; llama.cpp is built for single-stream efficiency and portability. - **CAUTION:** The 35x/44x figures come from saturating both engines with up to 64 concurrent users. This measures vLLM's continuous-batching/PagedAttention advantage at scale, NOT a general "vLLM is 40x faster" claim. At single-stream the gap collapses (see community-gh15180). Apples-to-apples this is NOT: it contrasts a batched GPU server against an engine running near single-stream defaults. ## Relevant Concepts - [[concepts/server-api]] - [[entities/binary-llama-server]] - [[entities/project-llama-cpp]] - [[concepts/build-and-backends]] - [[concepts/quantization]] ## Source Metadata - **Type:** Community / vendor article (Red Hat Developer) - **Author:** Harshith Umesh - **Date:** 2025-09-30 ⚠ FLAG: ~8 months stale relative to this KB (2026-05-30); both engines move fast. llama.cpp paged-attention work post-dates this and narrows the high-concurrency gap (see gh15180 Apr 2026 comment). - **URL:** https://developers.redhat.com/articles/2025/09/30/vllm-or-llamacpp-choosing-right-llm-inference-engine-your-use-case - **⚠ VENDOR BIAS:** Red Hat is a major corporate backer/maintainer of vLLM. The framing favors vLLM. Benchmark is not apples-to-apples (batched multi-user serving vs largely single-stream llama.cpp). --- title: "KV-Cache Quantisation Quality & VRAM (Sam McLeod, Dec 2024): q8_0 near-lossless" type: summary tags: [benchmarking, performance, kv-cache, community, cuda, metal, rocm] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-smcleod-kv-quant.md"] confidence: medium llama_build: "n/a (community source, 2024-12-04; specific llama.cpp build not stated)" source_url: "https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/" --- # KV-Cache Quantisation Quality & VRAM (Sam McLeod, Dec 2024): q8_0 near-lossless Sam McLeod's writeup on bringing llama.cpp KV-cache quantisation (q8_0 / q4_0 vs f16) to Ollama. Establishes the widely-cited "q8_0 KV is near-lossless, q4_0 saves ~66% VRAM with noticeable but usable quality loss" baseline. ## Key Points - **VRAM (generic 8B model @ 32K context):** F16 KV ~6 GB; **q8_0 KV ~3 GB (50% cut)**; **q4_0 KV ~2 GB (66% cut)**. - **Perplexity (Qwen 2.5 Coder 7B, Q6_K model weights):** F16/F16 KV baseline ppl **8.3891 ±0.02016**; **q8_0/q8_0 KV ppl 8.3934 ±0.02017 → +0.0043 (near-lossless)**. - **q4_0 KV quality:** adds ~**+0.206 to +0.25 perplexity** — "noticeable" but usable. - **K-cache more sensitive than V-cache:** per llama.cpp research, "the K cache seems to be much more sensitive to quantization than the V cache" → asymmetric quant (higher-precision K, lower-precision V) is a viable optimization. - **Requirements:** KV-quant requires Flash Attention enabled. Set in Ollama via `OLLAMA_KV_CACHE_TYPE="q8_0"`. - **Recommendation:** **q8_0 as the default** — minimal quality impact for normal text generation. - **Backend support:** Apple Silicon (Metal), NVIDIA (CUDA, Pascal+), AMD (ROCm); auto-fallback to F16 where unsupported. ## Relevant Concepts - [[concepts/kv-cache-and-context]] - [[concepts/quantization]] - [[entities/backend-cuda]] - [[entities/backend-metal]] - [[entities/backend-rocm]] - [[concepts/build-and-backends]] ## Source Metadata - **Type:** Community blog post (personal site, smcleod.net). - **Author:** Sam McLeod. - **Date:** 2024-12-04. STALENESS FLAG — ~18 months old at fetch; predates later llama.cpp KV-quant kernel changes; ppl figure is model-specific (Qwen 2.5 Coder 7B Q6_K). - **Hardware:** Not GPU-specific for the quality numbers; VRAM figure is a generic 8B @ 32K estimate. Backend support spans Metal/CUDA/ROCm. - **Build:** Specific llama.cpp build not stated. - **URL:** https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/ --- title: "Community Guide: SteelPh0enix — llama.cpp from scratch (build, quantize, run)" type: summary tags: [deployment, build, server-api, community, beginner, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-steelphoenix-guide.md"] confidence: medium llama_build: "n/a (community source, 2024-12-25)" source_url: "https://blog.steelph0enix.dev/posts/llama-cpp-guide/" --- # Community Guide: SteelPh0enix — llama.cpp from scratch A deep end-to-end reference (~13.8k words, HN front page) covering toolchain setup, building from source, downloading a HF model, converting to GGUF, quantizing, and running the server/CLI/bench tools. The raw mirror is a structured extraction of the technical commands (not the author's prose, for copyright reasons). Late-2024 content — flag/WebUI staleness flagged below. ## Key Points (verbatim commands) **Clone + submodules:** ```bash git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp git submodule update --init --recursive ``` **CMake configure (CPU) + build + install:** ```bash cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=/your/install/dir \ -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON cmake --build build --config Release -j X cmake --install build --config Release ``` **Vulkan backend:** add `-DGGML_VULKAN=ON` to the configure step; CUDA uses `-DGGML_CUDA=ON`. Verify devices with `llama-cli --list-devices`. **Get a model (skip LFS smudge):** ```bash GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct ``` **Convert HF -> GGUF:** ```bash python -m pip install --upgrade -r llama.cpp/requirements/requirements-convert_hf_to_gguf.txt python llama.cpp/convert_hf_to_gguf.py SmolLM2-1.7B-Instruct --outfile ./SmolLM2.gguf ``` **Quantize** (`llama-quantize [nthreads]`): ```bash llama-quantize SmolLM2.gguf SmolLM2.q8.gguf Q8_0 N ``` **Run server** (defaults: host 127.0.0.1, port 8080): ```bash llama-server -m SmolLM2.q8.gguf ``` Notable flags: `--host`, `--port`, `--ctx-size`, `--predict` (-1 unlimited, -2 context limit), `--threads`, `--gpu-layers`, `--flash-attn`, `--mlock`, `--no-mmap`. **Bench / CLI:** ```bash llama-bench --flash-attn 1 --model ./SmolLM2.q8.gguf -pg 1024,256 llama-cli --flash-attn --model ./SmolLM2.q8.gguf --prompt "You are helpful" --conversation ``` **Sampling defaults recommended:** Temp 0.2–2.0, Top-K 40, Top-P 0.95, Min-P 0.05; full GPU offload `--gpu-layers 999`; quant sweet spots Q8_0 / Q6_K / Q5. ## Relevant Concepts - [[concepts/build-and-backends]] - [[concepts/gguf-format]] - [[concepts/quantization]] - [[concepts/server-api]] - [[concepts/sampling-parameters]] - [[entities/binary-llama-server]] - [[entities/binary-llama-cli]] - [[entities/binary-llama-quantize]] - [[entities/binary-llama-bench]] - [[entities/backend-vulkan]] - [[entities/project-llama-cpp]] ## Source Metadata - Type: community blog guide (personal, long-form) - Author: SteelPh0enix (blog.steelph0enix.dev) - Date: published 2024-10-28, updated 2024-12-25 — **STALENESS FLAG**: server flags/defaults and the bundled WebUI described are late-2024 conventions and predate 2025/26 changes; the `-hf` repo-pull workflow is not the default path here. Cross-check flag names/defaults against the current `tools/server/README.md`. Core flags (`--host/--port/--ctx-size/--n-predict/ --threads/--gpu-layers/--flash-attn`) and the build/convert/quantize workflows remain valid. - URL: https://blog.steelph0enix.dev/posts/llama-cpp-guide/ --- title: "Unsloth Dynamic 2.0 GGUFs — layerwise quant + self-reported KL/MMLU benchmarks (VENDOR)" type: summary tags: [quantization, imatrix, accuracy, community, vendor, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-unsloth-dynamic-ggufs.md"] confidence: medium llama_build: "n/a (vendor docs; page updated 2026-04-20)" source_url: "https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs" --- # Unsloth Dynamic 2.0 GGUFs — layerwise quant + self-reported KL/MMLU benchmarks (VENDOR) ## Key Points - **VENDOR SOURCE — all numbers self-reported by Unsloth; not independently verified.** Filenames carry a `UD-` prefix (e.g. `UD-IQ2_XXS`) to mark Unsloth Dynamic quants. - **Core method:** layer-selective quantization. Unlike fixed schemes, Dynamic 2.0 "dynamically adjust[s] the quantization type of every possible layer," with combinations differing per layer and per model (Gemma 3's quantized layers differ from Llama 4's). Now works on all architectures (MoE + non-MoE), not just MoE as in the original DeepSeek-R1 1.58-bit work. - **Calibration:** new dataset >1.5M tokens, hand-curated, chat-optimized. Argues text-only Wikipedia calibration overfits and is "not effective for instruct models" (unique chat templates). Uses Calibration_v3/v5. For fair KLD benchmarking they DON'T use their chat-optimized set — they test on standard Wikipedia datasets. - **Metric stance:** KL Divergence is "one of the gold standards" (per "Accuracy is Not All You Need") because it correlates with "flips" (answers changing correct↔incorrect). Perplexity is rejected because "output token values can cancel out." Goal: minimize mean KLD per GB. - **Efficiency metric:** Efficiency = (MMLU 5-shot − 25) / Disk Space GB. (Minus 25 because random guessing over A/B/C/D scores 25%.) - **Gemma 3 12B KLD — Baseline imatrix vs New Dynamic 2.0 (lower=better, KLD / GB):** IQ1_S 1.035688/5.83 → 0.972932/6.06; IQ1_M 0.832252 → 0.800049; IQ2_XXS 0.535764 → 0.521039; IQ2_M 0.26554 → 0.258192; Q2_K_XL 0.229671 → 0.220937; Q3_K_XL 0.087845 → 0.080617; Q4_K_XL 0.024916 → 0.023701. New column lower KLD at modestly larger GB throughout. - **Gemma 3 27B MMLU 5-shot (Unsloth | +QAT | GB | Efficiency):** IQ1_S 41.87/43.37/6.06/3.03; IQ1_M 48.10/47.23/6.51/3.42; IQ2_XXS 59.20/56.57/7.31/4.32; IQ2_M 66.47/64.47/8.96/4.40; Q2_K 68.50/67.60/9.78/4.35; Q2_K_XL 68.70/67.77/9.95/4.30; IQ3_XXS 68.27/67.07/10.07/4.18; Q3_K_M 70.70/69.77/12.51/3.58; Q3_K_XL 70.87/69.50/12.76/3.49; Q4_K_M 71.23/71.00/15.41/2.98; Q4_K_XL 71.47/71.07/15.64/2.94; Q5_K_M 71.77/71.23/17.95/2.58; Q6_K 71.87/71.60/20.64/2.26; Q8_0 71.60/71.53/26.74/1.74; Google QAT —/70.64/17.2/2.65. - **Headline claim:** Dynamic Q4_K_XL (15.64GB, MMLU 71.47) is "2GB smaller whilst having +1% extra accuracy vs the QAT version" (Google QAT 17.2GB, 70.64). Best efficiency lands around IQ2_M (4.40) / Q2_K (4.35). - **Gemma 3 12B QAT reference:** Q4_0 QAT MMLU 67.07% vs full bf16 67.15%, 7.52GB. - **Llama 4 fixes (collab w/ Meta):** RoPE scaling (llama.cpp PR #12889); QK Norm epsilon should be 1e-05 not 1e-06; QK Norm head-sharing bug — fixing it raised MMLU Pro 68.58% → 71.53%. Also fixed Llama 3.1 8B MMLU implementation (wrong impl gives 35%, correct ~68.2%; "A" vs "_A" tokenization +0.4%; append "The best answer is"). ## Relevant Concepts - [[concepts/quantization]] — layerwise/mixed-precision quant scheme vs uniform K/I quants. - [[concepts/imatrix]] — Dynamic builds on imatrix calibration; central critique is calibration-dataset choice. - [[concepts/gguf-format]] — ships GGUFs with `UD-` prefix + extra formats (Q4_NL, Q5.1/5.0, Q4.1/4.0). - [[concepts/build-and-backends]] — Llama 4 RoPE/QK-Norm fixes landed in llama.cpp builds; CUDA build commands given. - [[entities/binary-llama-quantize]] — underlying quantizer Unsloth drives with custom per-layer recipes. ## Source Metadata - Type: community (VENDOR docs) - Author/platform: Unsloth AI. VENDOR BIAS: every benchmark is Unsloth's own; framing favors Dynamic 2.0 over generic imatrix and over Google QAT. Cross-check against Artefact2 / independent runs. - Date: page updated 2026-04-20 (Qwen3.6/Gemma 4); benchmarks shown are Gemma 3 / Llama 4 era. Relatively fresh but moving target. - URL: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs --- title: "Building llama.cpp Locally (Backend Matrix)" type: summary tags: [build, cpu, cuda, metal, vulkan, rocm, sycl, musa, cann, blas, developer] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/docs-build.md"] confidence: high llama_build: "master (~2026-05)" --- # Building llama.cpp Locally (Backend Matrix) ## Key Points - Main product is the `llama` library (C-style interface in `include/llama.h`); repo also ships many example programs/tools (incl. an OpenAI-compatible HTTP server). - Get the code: `git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp`. - Canonical CMake build flow: `cmake -B build` then `cmake --build build --config Release`. Add `-j 8` (or use Ninja) for parallel compile; install `ccache` for faster repeated builds. - Debug builds: single-config generators use `-DCMAKE_BUILD_TYPE=Debug`; multi-config (`-G "Xcode"`, Visual Studio) use `--config Debug`. - Static build: `-DBUILD_SHARED_LIBS=OFF`. - Windows: install Visual Studio 2022 (Desktop dev with C++, CMake tools, clang, MS-Build for LLVM). WoA/ARM64 uses presets `arm64-windows-llvm-release` (with `-D GGML_OPENMP=OFF`) or `x64-windows-llvm-release`. - Optional HTTPS/TLS: install OpenSSL dev libs (`libssl-dev` / `openssl-devel` / `openssl`); without it the project still builds and runs but with no SSL support. - BLAS (`-DGGML_BLAS=ON`): helps prompt processing at batch sizes > 32; does not affect generation speed. Select implementation via `-DGGML_BLAS_VENDOR=...` (OpenBLAS, `Intel10_64lp` for oneMKL, `Generic`, BLIS, etc.). Apple Accelerate is enabled by default on Mac. - Metal: enabled by default on macOS (runs compute on GPU). Disable at compile with `-DGGML_METAL=OFF`; disable GPU inference at runtime with `--n-gpu-layers 0`. - SYCL: supports Intel GPUs (Data Center Max/Flex, Arc, built-in/iGPU). See `docs/backend/SYCL.md`. - CUDA: `-DGGML_CUDA=ON`. Non-native (all GPUs) build adds `-DGGML_NATIVE=OFF`; specify archs via `-DCMAKE_CUDA_ARCHITECTURES="86;89"`; pick a CUDA install via `-DCMAKE_CUDA_COMPILER=/opt/cuda-11.7/bin/nvcc`. - CUDA runtime env vars: `CUDA_VISIBLE_DEVICES`, `CUDA_SCALE_LAUNCH_QUEUES=4x` (helps multi-GPU pipeline parallelism), `GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F`/`_16F`, `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` (RAM fallback on Linux), `GGML_CUDA_P2P` (peer access). Compile-time perf options: `GGML_CUDA_FORCE_MMQ`, `GGML_CUDA_FORCE_CUBLAS`, `GGML_CUDA_PEER_MAX_BATCH_SIZE` (default 128), `GGML_CUDA_FA_ALL_QUANTS`. - MUSA (Moore Threads GPU): `-DGGML_MUSA=ON`; archs via `-DMUSA_ARCHITECTURES="21"`; runtime `MUSA_VISIBLE_DEVICES`. Reuses many CUDA options. - HIP (AMD ROCm GPUs): `-DGGML_HIP=ON` with `-DGPU_TARGETS=gfx1030` (optional; omit to build for all detected GPUs). rocWMMA flash-attn boost via `-DGGML_HIP_ROCWMMA_FATTN=ON`. Runtime `HIP_VISIBLE_DEVICES`, `HSA_OVERRIDE_GFX_VERSION` (not on Windows). UMA via `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`. - Vulkan: `-DGGML_VULKAN=ON` (or `=1`). Needs Vulkan SDK + SPIRV-Headers (`spirv-headers` / `spirv-headers-devel`). On macOS uses MoltenVK or KosmicKrisp via `VK_ICD_FILENAMES`; combine with `-DGGML_METAL=OFF`. - CANN (Ascend NPU): `-DGGML_CANN=on -DCMAKE_BUILD_TYPE=release`. - ZenDNN (AMD EPYC CPUs): `-DGGML_ZENDNN=ON` (auto-downloads/builds ZenDNN on first build, 5-10 min). - Arm KleidiAI (CPU microkernels): `-DGGML_CPU_KLEIDIAI=ON`; SME control via env `GGML_KLEIDIAI_SME`. - OpenCL (Adreno GPU): `-DGGML_OPENCL=ON` (Android NDK / Windows ARM64 instructions provided). - WebGPU: `-DGGML_WEBGPU=ON` (relies on Dawn; browser builds via Emscripten + emdawnwebgpu). - OpenVINO (Intel CPU/GPU/NPU): see `docs/backend/OPENVINO.md` (in progress). - Multiple backends can be built together (e.g. `-DGGML_CUDA=ON -DGGML_VULKAN=ON`); select at runtime with `--device` (`--list-devices` to enumerate). Fully disable GPU with `--device none` (even `-ngl 0` may still use GPU). Dynamic backend loading via `GGML_BACKEND_DL`. ## Relevant Concepts - [[concepts/build-and-backends]] — this is the canonical build/backend reference: cmake flow plus the per-backend enable flags. - [[concepts/server-api]] — the OpenAI-compatible server is one of the built tools; SSL/OpenSSL note applies. - [[entities/binary-llama-cli]] — used in backend verification examples (`-ngl`, `--device none`). - [[entities/backend-cpu]] — default build target; BLAS/KleidiAI/ZenDNN augment it. - [[entities/backend-cuda]] — `-DGGML_CUDA=ON`, NVIDIA. - [[entities/backend-metal]] — default on macOS. - [[entities/backend-vulkan]] — `-DGGML_VULKAN=ON`, cross-vendor GPU. - [[entities/backend-rocm]] — HIP path, `-DGGML_HIP=ON`, AMD. - [[concepts/build-and-backends]] — `-DGGML_SYCL=ON` path, Intel GPU. ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + docs/build.md - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md --- title: "Function Calling in llama.cpp" type: summary tags: [function-calling, server-api, llama-server, api, well-established, developer] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/docs-function-calling.md"] confidence: high llama_build: "master (~2026-05)" --- # Function Calling in llama.cpp ## Key Points - Function calling is implemented in `common/chat.h` (PR #9639) and is used by `llama-server` when started with the `--jinja` flag. It supports OpenAI-style function/tool calling. - Universal support: works for **all** models via two paths — **Native** format handlers (model-specific parsers) and a **Generic** fallback used when the template isn't recognized (you'll see `Chat format: Generic` in the logs). Generic may consume more tokens and be less efficient than a native format. - Native tool-call formats supported: Llama 3.1 / 3.2 / 3.3 (incl. builtin tools `wolfram_alpha`, `web_search`/`brave_search`, `code_interpreter`), Functionary v3.1 / v3.2, Hermes 2/3, Qwen 2.5, Qwen 2.5 Coder, Mistral Nemo, Firefunction v2, Command R7B, DeepSeek R1 (WIP — reluctant to call tools). - Multiple/parallel tool calling is supported on some models but **disabled by default**; enable by passing `"parallel_tool_calls": true` in the completion endpoint payload. - The server requires a tool-aware Jinja template. Verify support by inspecting `chat_template` or `chat_template_tool_use` in `http://localhost:8080/props`. - Start command (native): `llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M`. Other verified models: Mistral-Nemo-Instruct-2407, Llama-3.3-70B-Instruct, ibm-granite/granite-4.1-3b. - Some GGUFs need a template override via `--chat-template-file`, e.g. DeepSeek R1 distills use `models/templates/llama-cpp-deepseek-r1.jinja` (official template is buggy; llama.cpp works around it). Functionary, Hermes 2 Pro / Hermes 3, Firefunction v2, Command R7B each have their own template file under `models/templates/`. - Generic-format example launches: `llama-server --jinja -fa -hf bartowski/phi-4-GGUF:Q4_0`, `gemma-2-2b-it`, `c4ai-command-r-v01`. - TIP: if no official `tool_use` template exists, try `--chat-template chatml` as a default (YMMV) or write your own. - CAUTION: extreme KV quantizations (e.g. `-ctk q4_0`) can substantially degrade tool-calling performance. - Official templates can be fetched with `scripts/get_chat_template.py`; the format-mapping table can be regenerated with `./build/bin/test-chat ../minja/build/tests/*.jinja` (note: doc has a TODO since the minja dependency was removed). - Tested via the OpenAI-compatible `/v1/chat/completions` endpoint by supplying a `tools` array (each tool: `{"type":"function","function":{"name","description","parameters":{...JSON schema...}}}`). - Response: tool calls returned with `finish_reason: "tool"` and `message.tool_calls` array (each with `name` and `arguments` as a JSON-encoded string), `message.content: null`. ## Relevant Concepts - [[concepts/function-calling]] — this doc is the canonical source for the feature, supported model formats, and template requirements. - [[concepts/server-api]] — function calling is exercised via `/v1/chat/completions` with `tools` / `parallel_tool_calls`. - [[concepts/server-api]] — the `--jinja` flag and `--chat-template-file` overrides are server flags. - [[entities/binary-llama-server]] — the binary that hosts function calling. ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + docs/function-calling.md - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md --- title: "Installing Pre-built llama.cpp (Package Managers)" type: summary tags: [build, deployment, well-established] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/docs-install.md"] confidence: high llama_build: "master (~2026-05)" --- # Installing Pre-built llama.cpp (Package Managers) ## Key Points - Pre-built install matrix: Winget (Windows), Homebrew (Mac + Linux), MacPorts (Mac), Nix (Mac + Linux). - Winget (Windows): `winget install llama.cpp` — auto-updated with new releases. - Homebrew (Mac & Linux): `brew install llama.cpp` — formula auto-updated with new releases. - MacPorts (Mac): `sudo port install llama.cpp`. - Nix (flake): `nix profile install nixpkgs#llama-cpp`. - Nix (non-flake): `nix-env --file '' --install --attr llama-cpp` — expression auto-updated in the nixpkgs repo. - These are the no-compile install paths; building from source is covered separately in `docs/build.md`, and Docker/release-binary paths are noted in the README. ## Relevant Concepts - [[concepts/build-and-backends]] — install is the pre-built alternative to building per-backend from source. - [[entities/binary-llama-cli]] — installed binaries include `llama-cli`. - [[concepts/server-api]] — installed binaries include `llama-server`. ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + docs/install.md - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md --- title: "Multimodal Input in llama.cpp (libmtmd) — Models & Usage" type: summary tags: [multimodal, vision, llama-server, llama-cli, well-established, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/docs-multimodal.md"] confidence: high llama_build: "master (~2026-05)" --- # Multimodal Input in llama.cpp (libmtmd) — Models & Usage ## Key Points - llama.cpp supports multimodal input via `libmtmd`. Two tools support it: `llama-mtmd-cli` and `llama-server` (via OpenAI-compatible `/chat/completions` API). - Supported modalities: **image** and **audio**. Audio is highly experimental and may have reduced quality. - Two ways to enable: (1) `-hf ` to pull a pre-quantized model; (2) `-m model.gguf --mmproj file.gguf` to specify the text model and the multimodal projector separately. - With `-hf`: `--no-mmproj` disables multimodal; `--mmproj local_file.gguf` uses a custom projector. - By default the multimodal projector is offloaded to GPU; disable with `--no-mmproj-offload`. - Example invocations: - `llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF` - `llama-server -hf ggml-org/gemma-3-4b-it-GGUF` - `llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.gguf` - `llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload` - Pre-quantized models (mostly `Q4_K_M` by default) live in the ggml-org Hugging Face "multimodal-ggufs" collection. Some models need a large context window, e.g. `-c 8192`. - **Vision models listed**: Gemma 3 (4b/12b/27b-it), SmolVLM (Instruct, 256M, 500M) & SmolVLM2 (2.2B, 256M-Video, 500M-Video), Pixtral 12B, Qwen2-VL (2B/7B), Qwen2.5-VL (3B/7B/32B/72B), Mistral Small 3.1 24B (IQ2_M), InternVL 2.5 (1B/4B) & InternVL3 (1B/2B/8B/14B), Llama 4 Scout (17B-16E), Moondream2 (20250414), Gemma 4 (E2B/E4B/26B-A4B/31B-it). - **Audio models listed**: Ultravox 0.5 (llama-3.2-1b, llama-3.1-8b), Mistral Voxtral-Mini-3B-2507, Qwen3-ASR (0.6B/1.7B). (Qwen2-Audio/SeaLLM-Audio have no pre-quantized GGUF — poor results.) - **Mixed modalities (audio + vision)**: Qwen2.5-Omni (3B/7B), Qwen3-Omni-30B-A3B (Instruct/Thinking), Gemma 4 (E2B/E4B-it). - OCR models (PaddleOCR-VL, GLM-OCR, Deepseek-OCR, Dots.OCR, HunyuanOCR) are trained with specific prompt/input structure — see linked PRs for correct usage. - Find more vision GGUFs on HF via `pipeline_tag=image-text-to-text` + search `gguf`. ## Relevant Concepts - [[concepts/multimodal-mtmd]] — this doc is the user-facing model list + run guide for mtmd. - [[entities/binary-mtmd]] — `llama-mtmd-cli` is one of the two tools documented here. - [[concepts/server-api]] — the other multimodal tool, via `/chat/completions`. ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + docs/multimodal.md - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md --- title: "GGUF File Format Specification" type: summary tags: [gguf, ggml, foundational, well-established, developer] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/gguf-spec.md"] confidence: high llama_build: "master (~2026-05)" --- # GGUF File Format Specification ## Key Points - GGUF is a binary, single-file, `mmap`-compatible model format for GGML-based inference; successor to GGML, GGMF, and GGJT. Its core difference from GGJT is a typed key-value metadata structure (instead of a list of untyped hyperparameters), making it extensible without breaking compatibility. - Models are little-endian by default; big-endian variants exist (introduced in v3). There is currently no in-file flag to detect endianness — assume little-endian if unspecified. - File layout: `gguf_header_t` (magic `GGUF` = bytes `0x47 0x47 0x55 0x46`; `version` uint32; `tensor_count` uint64; `metadata_kv_count` uint64; then the KV pairs) → `gguf_tensor_info_t[]` → padding → `tensor_data[]`. - Global alignment is set by `general.alignment` (uint32, must be a multiple of 8; default `32` if absent). Tensor data offsets must be multiples of `ALIGNMENT`; `align_offset(offset) = offset + (ALIGNMENT - (offset % ALIGNMENT)) % ALIGNMENT`. - Metadata keys must be valid ASCII, hierarchical `lower_snake_case` segments separated by `.`, at most 65535 (2^16-1) bytes. Tensor names must be standard GGUF strings at most 64 bytes long. Tensors currently have at most 4 dimensions. - `gguf_metadata_value_type` enum (uint32): UINT8=0, INT8=1, UINT16=2, INT16=3, UINT32=4, INT32=5, FLOAT32=6, BOOL=7 (1-byte, 0=false/1=true), STRING=8, ARRAY=9, UINT64=10, INT64=11, FLOAT64=12. Strings are UTF-8, non-null-terminated, length-prepended (`uint64 len`). Counts/lengths are `uint64` by convention; readers should also support `uint32`. - `ggml_type` enum (uint32) values include: F32=0, F16=1, Q4_0=2, Q4_1=3, Q5_0=6, Q5_1=7, Q8_0=8, Q8_1=9, Q2_K=10, Q3_K=11, Q4_K=12, Q5_K=13, Q6_K=14, Q8_K=15, IQ2_XXS=16, IQ2_XS=17, IQ3_XXS=18, IQ1_S=19, IQ4_NL=20, IQ3_S=21, IQ2_S=22, IQ4_XS=23, I8=24, I16=25, I32=26, I64=27, F64=28, IQ1_M=29, BF16=30, TQ1_0=34, TQ2_0=35, MXFP4=39, GGML_TYPE_COUNT=40. Removed: Q4_2=4, Q4_3=5; the Q4_0_4_4/4_8/8_8 (31–33) and IQ4_NL_4_4/4_8/8_8 (36–38) repacked types are no longer stored in GGUF files. - Required general metadata: **`general.architecture`** (string, `[a-z0-9]+`; known values `llama`, `mpt`, `gptneox`, `gptj`, `gpt2`, `bloom`, `falcon`, `mamba`, `rwkv`), **`general.quantization_version`** (uint32; required only if any tensor is quantized), **`general.alignment`** (uint32). - `general.file_type` (uint32) enumerates the majority tensor type: ALL_F32=0, MOSTLY_F16=1, MOSTLY_Q4_0=2, MOSTLY_Q4_1=3, MOSTLY_Q4_1_SOME_F16=4, MOSTLY_Q8_0=7, MOSTLY_Q5_0=8, MOSTLY_Q5_1=9, MOSTLY_Q2_K=10, MOSTLY_Q3_K_S=11, MOSTLY_Q3_K_M=12, MOSTLY_Q3_K_L=13, MOSTLY_Q4_K_S=14, MOSTLY_Q4_K_M=15, MOSTLY_Q5_K_S=16, MOSTLY_Q5_K_M=17, MOSTLY_Q6_K=18 (Q4_2=5/Q4_3=6 removed). This list visibly lags the `ggml_type` enum (no IQ-series, Q8_K, etc.). - Per-architecture LLM keys use an `[llm].` prefix: `context_length` (n_ctx), `embedding_length` (n_embd), `block_count`, `feed_forward_length` (n_ff), `attention.head_count` (n_head), `attention.head_count_kv` (GQA), `rope.dimension_count`, `rope.freq_base`, `rope.scaling.{type,factor,original_context_length,finetuned}` (type ∈ none/linear/yarn), `expert_count`/`expert_used_count` (MoE), and SSM keys for Mamba (`ssm.conv_kernel`, `ssm.inner_size`, `ssm.state_size`, `ssm.time_step_rank`). - Tokenizer metadata under `tokenizer.ggml.*`: `model` (llama/replit/gpt2/rwkv), `tokens`, `scores`, `token_type` (1=normal, 2=unknown, 3=control, 4=user defined, 5=unused, 6=byte), `merges`, `added_tokens`, plus special-token IDs (`bos_token_id`, `eos_token_id`, `unknown_token_id`, `separator_token_id`, `padding_token_id`). Chat template via `tokenizer.chat_template` (Jinja); full HF tokenizer via `tokenizer.huggingface.json`. - GGUF filename naming convention: `[].gguf`, `-`-delimited. Sidecars: `mmproj` (multimodal projector), `mtp` (multi-token-prediction draft module). Type values: `LoRA`, `vocab` (default = tensor model). Shards formatted `-of-`, zero-padded 5 digits, starting at `00001`. Minimum valid set = BaseName + SizeLabel + Version. SizeLabel scale prefixes: Q(uadrillion), T(rillion), B(illion), M(illion), K(thousand). - Standardized tensor names: base layers `token_embd`, `pos_embd`, `output_norm`, `output`; per-block `blk.N.BB` with `attn_norm`, `attn_q/k/v`, `attn_qkv`, `attn_output`, `ffn_norm`, `ffn_up/gate/down`, plus MoE variants `ffn_gate_inp`, `ffn_{gate,down,up}_exp`, and SSM tensors `ssm_in`, `ssm_conv1d`, `ssm_x`, `ssm_a`, `ssm_d`, `ssm_dt`, `ssm_out`. - Format version history: v1 initial; v2 widened countable values from uint32 to uint64; v3 added big-endian support. Spec version field must be `3`. ## Relevant Concepts - [[concepts/gguf-format]] — this document is the canonical GGUF binary layout and metadata spec - [[concepts/quantization]] — the `ggml_type` and `general.file_type` enums enumerate all quant types GGUF can carry - [[concepts/kv-cache-and-context]] — RoPE/YaRN scaling and context-length metadata keys live here (`rope.scaling.*`, `context_length`) ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/ggml + `docs/gguf.md` - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/ggml/blob/master/docs/gguf.md --- title: "GBNF Guide — Constraining Output with Grammars" type: summary tags: [grammars, sampling, structured-output, llama-server, llama-cli, well-established, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/grammars-readme.md"] confidence: high llama_build: "master (~2026-05)" --- # GBNF Guide — Constraining Output with Grammars ## Key Points - GBNF (GGML BNF) is a format for defining formal grammars to constrain model outputs in `llama.cpp` (e.g. force valid JSON, or emoji-only output). Supported in `tools/cli`, `tools/completion`, and `tools/server`. - It extends Backus-Naur Form (BNF) with regex-like features. You define *production rules*: `nonterminal ::= sequence...` — non-terminals (rule names) expand into sequences of terminals (Unicode code points) and other non-terminals. - Non-terminal names must be dashed lowercase words (`move`, `castle`, `check-mate`). - Terminals: literal sequences like `"1"`, `"O-O"`; or character ranges like `[1-9]`, `[NBKQR]`. Full Unicode supported, directly (`[ぁ-ゟ]`) or via escapes: 8-bit `\xXX`, 16-bit `\uXXXX`, 32-bit `\UXXXXXXXX`. Ranges negated with `^`: `single-line ::= [^\n]+ "\n"`. - Alternatives use `|`; grouping uses parentheses `()`; sequence order matters. - Repetition/optional operators: `*` = zero or more (`{0,}`), `+` = one or more (`{1,}`), `?` = optional (`{0,1}`), `{m}` exactly m, `{m,}` at least m, `{m,n}` between m and n, `{0,n}` at most n. - Tokens: match tokenizer tokens not characters. `<[token-id]>` (e.g. `<[1000]>`) matches by token ID; `` (e.g. ``) matches by token text (only if it tokenizes to exactly one token, else grammar fails to parse). Negate with `!` prefix: `!<[1000]>` or `!` matches any token except that one. - Comments use `#`. Newlines allowed between rules and inside parentheses; a newline after `|` continues the current rule even outside parentheses. - The `root` rule always defines the grammar's starting point / what the entire output must match. - Sample chess grammar root: `root ::= ( "1. " move " " move "\n" ([1-9] [0-9]? ". " move " " move "\n")+ )` and `move ::= (pawn | nonpawn | castle) [+#]?`. - Usage: server completion endpoints accept a `grammar` body field; `llama-cli` / `llama-completion` accept `--grammar` & `--grammar-file` flags; `test-gbnf-validator` tests grammars against strings. - Try a grammar file: `./llama-cli -m --grammar-file grammars/some-grammar.gbnf -p 'Some prompt'`. - JSON Schema → GBNF: `llama.cpp` converts a subset of json-schema.org to GBNF. Server: `json_schema` body field on completion endpoints, or inside `response_format` on `/chat/completions` (`{type:"json_object", schema:...}` or `{type:"json_schema", json_schema:{schema:...}}`). CLI: `--json` / `-j` flag. Ahead-of-time: `examples/json_schema_to_grammar.py name-age-schema.json`. - NOTE: the JSON schema only constrains output and is NOT injected into the prompt (model has no visibility into the schema) — except for tool calling, where schemas ARE injected. - `additionalProperties` defaults to `false` in the converter (faster grammars, fewer hallucinations) even though the JSON Schema spec defaults it to `true`. Set `"additionalProperties": true` to allow extras (may produce keys with unescaped newlines). Pydantic: `extra='allow'`; Zod: `passthrough()`/`nonstrict()` (though zod-to-json-schema always emits `false`). - Known JSON-schema limitations: can't mix `properties` with `anyOf`/`oneOf`; `prefixItems` broken (`items` works); `minimum`/`maximum`/exclusive bounds only for `integer` not `number`; nested `$ref`s broken; `pattern`s must start `^` end `$`; remote `$ref`s unsupported in C++ (Python/JS fetch them); no `uri`/`email` string formats; no `patternProperties`. Unlikely-ever: `uniqueItems`, `contains`/`minContains`, `$anchor`, `not`, conditionals (`if`/`then`/`else`/`dependentSchemas`). - Performance gotcha: `x? x? x?...` (N repetitions) can make sampling extremely slow; use `x{0,N}` instead (or N-deep nesting `(x (x ...)?)?` in older versions). See issue #4218. ## Relevant Concepts - [[concepts/gbnf-grammars]] — this doc IS the canonical GBNF syntax + usage reference. - [[concepts/sampling-parameters]] — grammars act as a constraint applied during sampling. - [[concepts/server-api]] — exposes grammars via `grammar` / `json_schema` / `response_format` body fields. - [[entities/binary-llama-cli]] — exposes grammars via `--grammar`, `--grammar-file`, `--json`/`-j`. ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + grammars/README.md - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md --- title: "llama-imatrix Tool README" type: summary tags: [imatrix, quantization, accuracy, advanced, developer, well-established] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/imatrix-readme.md"] confidence: high llama_build: "master (~2026-05)" --- # llama-imatrix Tool README ## Key Points - `llama-imatrix` computes an importance matrix (imatrix) for a model over a given text/calibration dataset; the imatrix is consumed during quantization (via `llama-quantize --imatrix`) to improve quantized-model quality. - Usage: `./llama-imatrix -m model.gguf -f some-text.txt [-o imatrix.gguf] [--output-format {gguf,dat}] [--no-ppl] [--process-output] [--chunk 123] [--save-frequency 0] [--output-frequency 10] [--in-file imatrix-prev-0.gguf ...] [--parse-special] [--show-statistics]`. `-m | --model` and `-f | --file` (calibration data, e.g. `wiki.train.raw`) are mandatory. - Key flags: `-o | --output-file` (default `imatrix.gguf`), `-ofreq | --output-frequency` (save computed result every N chunks; default 10), `--output-format` (`gguf` default or legacy `dat`), `--save-frequency` (save a separate snapshot copy every N chunks; default 0 = never), `--process-output` (also collect data for `output.weight`; default false — typically better not to use the imatrix on `output.weight`), `--in-file` (load and combine one or more existing imatrix files; repeatable, for merging runs/datasets), `--parse-special` (parse special tokens like `<|im_start|>`), `--chunk | --from-chunk` (skip the first N chunks), `--chunks` (max chunks to process; default -1 = all), `--no-ppl` (skip perplexity calc for speed), `-lv | --verbosity` (0/1/>=2; default 1), `--show-statistics` (display imatrix file stats). - Output format: recent `llama-imatrix` versions store data in GGUF format by default; to get the legacy binary `.dat` format, use a non-`.gguf` extension or `--output-format dat`. Format conversion is bidirectional via `--in-file` + `--output-format` (gguf↔dat). - GPU offloading via `-ngl | --n-gpu-layers` speeds up computation (examples use `-ngl 99`). - Example end-to-end: `./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt -ngl 99` then `./llama-quantize --imatrix imatrix.gguf ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m`. Combine imatrices: `./llama-imatrix --in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf -o imatrix-combined.gguf`. - `--show-statistics` reports per-tensor metrics: Σ(Act²) (sum of squared activations = importance scores), Min & Max, μ & σ (mean/std of squared activations), % Active (fraction of elements whose mean squared activation exceeds threshold 1e-5 — tensor liveness), N (count of squared activations), Entropy (Shannon entropy of squared-activation distribution in bits, S = −Σ pᵢ log₂ pᵢ), E (norm) (normalized entropy = entropy / log₂ N), ZD Score (z-score distribution per "Layer-Wise Quantization", arXiv 2406.17415 §3.1), and CosSim (cosine similarity vs previous layer's tensor). Per-layer: weighted averages of Σ(Act²), ZD Score, and CosSim. - Important caveat from the doc: all statistics are computed on **squared** activations, not raw activations — they are useful but less reliable, and CosSim can be misleading when a tensor contains opposite vectors. ## Relevant Concepts - [[concepts/imatrix]] — core subject: importance-matrix computation and statistics - [[concepts/quantization]] — imatrix feeds quantization to reduce accuracy loss - [[entities/binary-imatrix]] — this README documents that binary's CLI directly - [[entities/binary-llama-quantize]] — downstream consumer via `--imatrix` - [[concepts/gguf-format]] — default imatrix output is GGUF; legacy `.dat` alternative ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + `tools/imatrix/README.md` - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/tools/imatrix/README.md --- title: "llama-bench — Performance Benchmarking Tool" type: summary tags: [llama-bench, benchmarking, performance, kv-cache, well-established, developer, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/llama-bench-readme.md"] confidence: high llama_build: "master (~2026-05)" --- # llama-bench — Performance Benchmarking Tool ## Key Points - `llama-bench` (tools/llama-bench) is the performance testing tool for llama.cpp. It reports throughput in average tokens per second (t/s) with standard deviation. - Three test types: prompt processing (`pp`, batches a prompt, set via `-p`/`--n-prompt`), text generation (`tg`, generates tokens, via `-n`/`--n-gen`), and combined prompt-processing + text-generation (`pg`, via `-pg `). - Measurements do NOT include tokenization or sampling time (explicit NOTE in docs). - Each test repeated `-r` times (default 5) and averaged. JSON output includes per-repetition `samples_ns`/`samples_ts`. - Multi-value sweeps: with the exception of `-r`, `-o`, `-v`, every option can be given multiple values (comma-separated `-n 16,32` or repeated `-n 16 -n 32`). Each pp/tg test runs across all combinations. Ranges: `first-last`, `first-last+step`, `first-last*mult`. - Context-depth testing: `-d, --n-depth ` (default 0) prefills the KV cache with `` tokens; results print as e.g. `pp512 @ d512`. - Key test-parameter defaults: `-m/--model` (default `models/7B/ggml-model-q4_0.gguf`); `-p/--n-prompt` 512; `-n/--n-gen` 128; `-b/--batch-size` 2048; `-ub/--ubatch-size` 512; `-ctk/-ctv` cache type f16; `-t/--threads` system dependent; `-ngl/--n-gpu-layers` -1; `-ncmoe/--n-cpu-moe` 0; `-sm/--split-mode` layer; `-mg/--main-gpu` 0; `-nkvo/--no-kv-offload` 0; `-fa/--flash-attn` auto; `-dev/--device` auto; `-mmp/--mmap` 1; `-dio/--direct-io` 0; `-embd/--embeddings` 0; `-ts/--tensor-split` 0; `-ot/--override-tensor`; `-nopo/--no-op-offload` 0; `--no-host` 0. - Run-control options: `-r/--repetitions` (default 5); `--prio` (-1..3); `--delay` seconds (default 0); `--numa `; `--no-warmup`; `--progress`; `--list-devices`; `-v/--verbose`; `-rpc/--rpc` (register RPC devices); `-fitt/--fit-target` MiB and `-fitc/--fit-ctx` (default 4096). - Model can be pulled from Hugging Face: `-hf/--hf-repo /[:quant]` (default quant Q4_K_M), `-hff/--hf-file`, `-hft/--hf-token` (or HF_TOKEN env). - Output formats via `-o, --output ` (default `md`); `-oe, --output-err` mirrors to stderr (default none). SQL output is importable into SQLite via the `sqlite3` CLI. - Markdown table columns: model, size, params, backend, ngl, test, t/s (plus n_batch/threads columns when those are swept). Structured formats add build_commit, build_number, cpu_info, gpu_info, backends, model_size, model_n_params, avg_ns, stddev_ns, avg_ts, stddev_ts, etc. ## Relevant Concepts - [[entities/binary-llama-bench]] — this is the canonical benchmarking tool; pp/tg/pg metrics defined here. - [[entities/binary-llama-bench]] — the binary this reference documents. - [[concepts/build-and-backends]] — sweeps over ngl, split-mode, devices, flash-attn, threads to compare backend performance. - [[concepts/kv-cache-and-context]] — `-d` context-depth prefill and cache-type sweeps. ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + tools/llama-bench/README.md - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/tools/llama-bench/README.md --- title: "llama.cpp Project README (Overview)" type: summary tags: [foundational, well-established, build, deployment, developer] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/llamacpp-readme.md"] confidence: high llama_build: "master (~2026-05)" --- # llama.cpp Project README (Overview) ## Key Points - Tagline: "LLM inference in C/C++". Main goal: enable LLM inference with minimal setup and state-of-the-art performance across a wide range of hardware, locally and in the cloud. License: MIT. Org: ggml-org. It is the main playground for developing features for the ggml library. - Core characteristics: plain C/C++ with no dependencies; Apple silicon as first-class citizen (ARM NEON, Accelerate, Metal); AVX/AVX2/AVX512/AMX on x86; RVV/ZVFH/ZFH/ZICBOP/ZIHINTPAUSE on RISC-V; 1.5/2/3/4/5/6/8-bit integer quantization; custom CUDA kernels (AMD via HIP, Moore Threads via MUSA); Vulkan and SYCL backends; CPU+GPU hybrid inference for models larger than total VRAM. - Quick start install paths: brew/nix/winget (`docs/install.md`), Docker (`docs/docker.md`), pre-built release binaries, or build from source (`docs/build.md`). - Example commands: `llama-cli -m my_model.gguf`; download+run from HF `llama-cli -hf ggml-org/gemma-3-1b-it-GGUF`; serve `llama-server -hf ggml-org/gemma-3-1b-it-GGUF`. - Models require GGUF format; convert other formats with the repo's `convert_*.py` scripts. Download by HF arg `-hf /[:quant]`; default source is Hugging Face, switchable via env `MODEL_ENDPOINT`. HF tooling: GGUF-my-repo, GGUF-my-LoRA, GGUF-editor, Inference Endpoints. - Supported backends table (target devices): Metal (Apple Silicon), BLAS (All), BLIS (All), SYCL (Intel GPU), OpenVINO [in progress] (Intel CPU/GPU/NPU), MUSA (Moore Threads GPU), CUDA (Nvidia GPU), HIP (AMD GPU), ZenDNN (AMD CPU), Vulkan (GPU), CANN (Ascend NPU), OpenCL (Adreno GPU), IBM zDNN (IBM Z & LinuxONE), WebGPU (All), RPC (All), Hexagon [in progress] (Snapdragon), VirtGPU (VirtGPU APIR). - Bundled tools/binaries documented: `llama-cli` (general CLI; conversation mode auto-activates for models with chat template, `-cnv`, `--chat-template`, GBNF grammars via `--grammar-file`), `llama-server` (OpenAI-compatible HTTP server, default port 8080, `/v1/chat/completions`, web UI, `-np` parallel decoding, `-md` speculative decoding, `--embedding --pooling cls`, `--reranking`, grammar constraints), `llama-perplexity` (perplexity / KL divergence quality metrics), `llama-bench` (performance benchmarking), `llama-simple` (minimal example for developers). - Headline supported model families (text): LLaMA 1/2/3, Mistral, Mixtral MoE, DBRX, Jamba, Falcon, Qwen, Deepseek, Gemma, Phi/PhiMoE, Mamba, Grok-1, Command-R, Granite, GLM-4, SmolLM, RWKV-6/7, Hunyuan, LFM2, and many more. Multimodal: LLaVA 1.5/1.6, BakLLaVA, MiniCPM, Moondream, Qwen2-VL, GLM-EDGE, LFM2-VL, etc. - Recent/hot topics: `gpt-oss` model with native MXFP4 (NVIDIA collaboration); multimodal in `llama-server`; new WebUI; WebGPU in the browser; HF cache migration so `-hf` downloads land in the standard HF cache. - Rich ecosystem: many language bindings (Python, Go, Node, Rust, C#/.NET, Java, Swift, etc.), UIs (Ollama, LM Studio, Jan, KoboldCpp, GPT4All, llamafile, etc.), infra (Paddler, GPUStack, llama-swap), and an XCFramework precompiled lib for iOS/macOS Swift projects. - Governance: contributors open PRs; collaborators invited by contribution; maintainers merge to `master`. Bundled third-party single-header deps: cpp-httplib, stb-image, nlohmann/json, miniaudio, subprocess.h. ## Relevant Concepts - [[concepts/build-and-backends]] — README enumerates the full backend matrix and target devices. - [[concepts/gguf-format]] — required model format; conversion via `convert_*.py`. - [[concepts/quantization]] — 1.5–8 bit integer quantization is a headline feature. - [[entities/binary-llama-cli]] — primary CLI tool, usage examples. - [[concepts/server-api]] — OpenAI-compatible HTTP server. - [[entities/backend-cpu]] — plain C/C++ core, x86/ARM/RISC-V SIMD. - [[entities/backend-cuda]] — custom CUDA kernels for NVIDIA. - [[entities/backend-metal]] — Apple silicon first-class support. - [[entities/backend-vulkan]] — cross-vendor GPU backend. - [[entities/backend-rocm]] — AMD GPU via HIP. - [[concepts/build-and-backends]] — Intel GPU backend. ## Source Metadata - Type: official documentation (mirror) — project README - Repo/path: ggml-org/llama.cpp + README.md - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/README.md --- title: "Multimodal Support Directory (libmtmd) — Architecture & History" type: summary tags: [multimodal, vision, llama-cli, developer, intermediate, well-established] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/mtmd-readme.md"] confidence: high llama_build: "master (~2026-05)" --- # Multimodal Support Directory (libmtmd) — Architecture & History ## Key Points - The `tools/mtmd` directory provides multimodal capabilities for llama.cpp. It began as a showcase for running LLaVA models, but scope expanded to many vision-capable models — LLaVA is no longer the only architecture supported. - Multimodal support is a sub-project under **very heavy development**; **breaking changes are expected**. - History/timeline: - #3436: initial LLaVA 1.5 support, introducing `llava.cpp` and `clip.cpp`; `llava-cli` binary created. - #4954: MobileVLM added (2nd vision model) atop `llava.cpp`/`clip.cpp`/`llava-cli`. - Expansion & fragmentation: many models added; `llava-cli` couldn't handle complex chat templates, spawning model-specific binaries `qwen2vl-cli`, `minicpmv-cli`, `gemma3-cli` — confusing proliferation. - #12849: `libmtmd` introduced to replace `llava.cpp` — single unified CLI, better UX/DX, audio + image input. - #13012: `mtmd-cli` added, consolidating the model-specific CLIs into one tool powered by `libmtmd`. - How it works: images are encoded into embeddings by a separate model component (the projector), then fed into the language model. Multimodal components are kept distinct from core `libllama` to allow faster independent development. - Running a multimodal model typically needs **two GGUF files**: (1) the standard language model file, (2) a **multimodal projector (`mmproj`)** file handling image encoding + projection. - `libmtmd` is built on `clip.cpp` (like `llava.cpp` was). Advantages: unified interface, improved API inspired by Hugging Face `transformers` `Processor` class, flexibility for multiple input types (text/audio/image) while respecting varied chat templates. - Obtaining `mmproj`: for these models use `convert_hf_to_gguf.py --mmproj` — Gemma 3 (1B variant has NO vision), SmolVLM, SmolVLM2, Pixtral 12B (only `transformers`-compatible checkpoint), Qwen 2 VL & Qwen 2.5 VL, Mistral Small 3.1 24B, InternVL 2.5 & InternVL 3 (only non-HF version; `InternVL3-*-hf` and `InternLM2Model` text model unsupported), MiniCPM-V 4.6 (needs `transformers` v5.7.0+ checkpoint). - Older models: use per-model guides; conversion scripts live under `tools/mtmd/legacy-models`. Listed: LLaVA, MobileVLM, GLM-Edge, MiniCPM-V 2.5/2.6, MiniCPM-o 2.6, MiniCPM-V 4.0, MiniCPM-o 4.0, MiniCPM-V 4.5, IBM Granite Vision. - Pre-quantized model list lives in `docs/multimodal.md`. ## Relevant Concepts - [[concepts/multimodal-mtmd]] — this doc explains what mtmd/libmtmd is and its architecture. - [[entities/binary-mtmd]] — `llama-mtmd-cli` / `mtmd-cli`, the unified binary described here. ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + tools/mtmd/README.md - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md --- title: "llama-quantize Tool README" type: summary tags: [quantization, llama-quantize, accuracy, memory, well-established, developer] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/quantize-readme.md"] confidence: high llama_build: "master (~2026-05)" --- # llama-quantize Tool README ## Key Points - `llama-quantize` takes a high-precision GGUF input (typically F32 or BF16) and converts it to a quantized GGUF. Quantization shrinks model size and can speed inference at the cost of accuracy loss, measured in Perplexity (ppl) and/or Kullback–Leibler Divergence (kld). Accuracy loss can be minimized with a suitable imatrix file. - Invocation: `./llama-quantize [options] input-model-f32.gguf [output-model-quant.gguf] type [threads]`. The output filename and thread count are optional positional args; `type` (e.g. `Q4_K_M`, case-insensitive — `q4_k_m` works) is required. - Typical workflow: `python3 convert_hf_to_gguf.py ./models/mymodel/` to produce `ggml-model-f16.gguf`, then `./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M`. Use the `COPY` type to update an old gguf filetype to the current version without re-quantizing. - Options: `--allow-requantize` (requantize already-quantized tensors; warns of severe quality loss vs quantizing from 16/32-bit), `--leave-output-tensor` (leave `output.weight` un(re)quantized; larger but possibly higher quality), `--pure` (disable k-quant mixtures; quantize all tensors to the same type), `--imatrix ` (use a `llama-imatrix`-generated importance matrix; highly recommended), `--include-weights` / `--exclude-weights` (apply imatrix to a tensor list; mutually exclusive), `--output-tensor-type` (quant type for `output.weight`), `--token-embedding-type` (quant type for token embeddings), `--keep-split` (preserve input shards instead of merging to one file). - Advanced options: `--tensor-type` (quantize specific tensor(s) to specific types; supports regex; repeatable), `--prune-layers ` (remove listed layers), `--override-kv ` (override model metadata in the output; repeatable, e.g. `qwen3moe.expert_used_count=int:16`). - Regex tensor-type example (per-layer parity): `--tensor-type "\.(\d*[13579])\.attn_k=q5_k" --tensor-type "\.(\d*[02468])\.attn_q=q3_k"`. A `copy` + `--prune-layers` + `--override-kv` combination can prune and rewrite metadata without quantizing. - Memory/disk: models are fully loaded into memory; memory and disk requirements are equal. Llama 3.1 examples — 8B: 32.1 GB original → 4.9 GB Q4_K_M; 70B: 280.9 GB → 43.1 GB; 405B: 1,625.1 GB → 249.1 GB. - Quant-type table (Llama-3.1-8B, bits/weight & size GiB): IQ1_S 2.0042/1.87, IQ1_M 2.1460/2.01, IQ2_XXS 2.3824/2.23, IQ2_XS 2.5882/2.42, IQ2_S 2.7403/2.56, IQ2_M 2.9294/2.74, IQ3_XXS 3.2548/3.04, IQ3_XS 3.4977/3.27, IQ3_S 3.6606/3.42, IQ3_M 3.7628/3.52, IQ4_XS 4.4597/4.17, IQ4_NL 4.6818/4.38, Q2_K_S 2.9697/2.78, Q2_K 3.1593/2.95, Q3_K_S 3.6429/3.41, Q3_K_M 3.9960/3.74, Q3_K_L 4.2979/4.02, Q4_K_S 4.6672/4.36, Q4_K_M 4.8944/4.58, Q5_K_S 5.5704/5.21, Q5_K_M 5.7036/5.33, Q6_K 6.5633/6.14, Q8_0 8.5008/7.95, F16 16.0005/14.96. - Benchmark trend (8B): smaller quants are smaller but text-generation t/s does not scale monotonically; F16 has the highest prompt-processing t/s (923.49) but lowest text-generation t/s (29.17 @128), while low-bit quants reach ~70–90 t/s text generation. - GGUF-my-repo Hugging Face Space lets users build quants without local setup; it is synced from llama.cpp `main` every 6 hours. - Run a quantized model: `./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"` (`-cnv` = conversation mode). ## Relevant Concepts - [[concepts/quantization]] — core subject: quant families, bits/weight, accuracy tradeoffs - [[concepts/imatrix]] — `--imatrix` flag and importance-matrix-guided quantization - [[entities/binary-llama-quantize]] — this README documents that binary's CLI directly - [[entities/binary-imatrix]] — imatrix files consumed via `--imatrix` - [[concepts/gguf-format]] — input/output are GGUF; `--override-kv` edits GGUF metadata ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + `tools/quantize/README.md` - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md --- title: "llama-server: HTTP Server README" type: summary tags: [llama-server, server-api, api, sampling, embeddings, deployment, function-calling, well-established] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/server-readme.md"] confidence: high llama_build: "master (~2026-05)" --- # llama-server: HTTP Server README ## Key Points - `llama-server` is a fast, lightweight pure C/C++ HTTP server (built on cpp-httplib + nlohmann::json + llama.cpp) exposing LLM REST APIs and a Web UI. Features: OpenAI-compatible chat/completions/responses/embeddings, Anthropic Messages API compatibility, reranking, parallel decoding/multi-user, continuous batching, multimodal, monitoring, schema-constrained JSON, assistant-message prefilling, function calling, speculative decoding. - Quick start (Unix): `./llama-server -m models/7B/ggml-model.gguf -c 2048`; Windows: `llama-server.exe -m models\7B\ggml-model.gguf -c 2048`. Defaults to listening on `127.0.0.1:8080`. - Build with CMake: `cmake -B build` then `cmake --build build --config Release -t llama-server` (binary at `./build/bin/llama-server`). SSL build adds `-DLLAMA_OPENSSL=ON`. - Docker: `docker run -p 8080:8080 -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080`; CUDA variant uses `:server-cuda` image, `--gpus all`, `--n-gpu-layers 99`. - Key load/runtime flags: `-m/--model FNAME`, `-c/--ctx-size N` (default 0 = from model), `-n/--predict N` (default -1 = infinity), `-ngl/--n-gpu-layers N` (default auto), `-b/--batch-size` (2048), `-ub/--ubatch-size` (512), `--host` (127.0.0.1), `--port` (8080), `-np/--parallel N` (default -1 = auto slots), `-cb/--cont-batching` (continuous batching, enabled by default), `-fa/--flash-attn [on|off|auto]` (default auto), `-hf/--hf-repo /[:quant]` (HF download, default quant Q4_K_M), `-cmoe/--cpu-moe`, `-ncmoe/--n-cpu-moe N`. - Most CLI args have matching env vars (`LLAMA_ARG_*`); CLI takes precedence over env. Boolean env vars accept `1/on/enabled` vs `0/off/disabled`. - Default sampler order (CLI `--samplers`): `penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature`. Default sampling values: `--temp 0.80`, `--top-k 40`, `--top-p 0.95`, `--min-p 0.05`, `--repeat-penalty 1.00` (CLI) but `/completion` request default `repeat_penalty 1.1`, `--repeat-last-n 64`. - Structured output flags: `--grammar GRAMMAR`, `--grammar-file FNAME`, `-j/--json-schema SCHEMA`, `-jf/--json-schema-file FILE`. - Native (non-OAI) endpoints: `/health`, `/completion`, `/tokenize`, `/detokenize`, `/apply-template`, `/embedding`, `/embeddings`, `/reranking` (aliases `/rerank`, `/v1/rerank`, `/v1/reranking`), `/infill`, `/props` (GET/POST), `/slots`, `/metrics`, `/slots/{id}?action=save|restore|erase`, `/lora-adapters` (GET/POST). - OpenAI-compatible endpoints: `/v1/models`, `/v1/completions`, `/v1/chat/completions`, `/v1/responses`, `/v1/embeddings`. Anthropic-compatible: `/v1/messages`, `/v1/messages/count_tokens`. - `/completion` (NOT OAI-compatible — use `/v1/completions` for OAI) accepts `prompt` (string, token array, mixed, or `{prompt_string, multimodal_data:[base64]}`), and a rich set of sampling fields: `temperature`, `top_k`, `top_p`, `min_p`, `n_predict`, `n_keep`, `stop`, `grammar`, `json_schema`, `seed`, `logit_bias`, `n_probs`, `samplers`, `cache_prompt` (default true), `stream`, `dry_*`, `xtc_*`, `mirostat*`, etc. - `/v1/chat/completions` supports `response_format` (`{"type":"json_object"}` or `{"type":"json_schema","schema":{...}}`), `chat_template_kwargs` (e.g. `{"enable_thinking": false}`), `reasoning_format`, `parse_tool_calls`, `parallel_tool_calls`. Tool/function calling requires `--jinja`. Reasoning returned via `reasoning_content` field (Deepseek-style). - Reasoning control: `--reasoning-format {none,deepseek,deepseek-legacy}` (env `LLAMA_ARG_THINK`, default auto), `-rea/--reasoning [on|off|auto]`, `--reasoning-budget N` (-1 unrestricted). Jinja chat templating enabled by default (`--jinja/--no-jinja`). - Embeddings: `--embedding/--embeddings` restricts server to embedding mode; `--pooling {none,mean,cls,last,rank}`; `--embd-normalize N` (default 2 = Euclidean/L2). `/v1/embeddings` requires pooling != none; native `/embeddings` supports `--pooling none` (returns per-token unnormalized). - Reranking requires `--embedding --pooling rank` (or `--rerank`) plus a reranker model (e.g. bge-reranker-v2-m3). - Monitoring: `--metrics` enables Prometheus `/metrics` (counters/gauges like `llamacpp:prompt_tokens_total`, `llamacpp:predicted_tokens_seconds`). `/slots` enabled by default (`--no-slots` to disable). `--props` enables POST `/props`. - Router (multi-model) mode: launch `llama-server` with NO model; route by `"model"` JSON field (POST) or `?model=` query (GET). Sources: cache (`LLAMA_CACHE`), `--models-dir`, or `--models-preset` INI file. Flags: `--models-max N` (default 4), `--models-autoload/--no-models-autoload`. Endpoints `/models`, `/models/load`, `/models/unload`. - Sleep-on-idle: `--sleep-idle-seconds SECONDS` (default -1 = disabled) unloads model + KV cache from RAM after inactivity; reloads on new task. `/health`, `/props`, `/models` do not reset the idle timer. - Built-in agent tools (Web UI) via `--tools all` or comma list: `read_file, file_glob_search, grep_search, exec_shell_command, write_file, edit_file, apply_diff, get_datetime` — "do not enable in untrusted environments". - Auth: `--api-key KEY` (comma-separated, env `LLAMA_API_KEY`) or `--api-key-file`. `/health` is public (no key). Errors returned in OpenAI format `{"error":{"code","message","type"}}`. ## Relevant Concepts - [[concepts/server-api]] — this README is the primary reference for the binary, its CLI flags, and operation. - [[concepts/server-api]] — documents the full native + OpenAI-compatible + Anthropic endpoint surface and request/response JSON. - [[concepts/sampling-parameters]] — defines all sampler CLI flags / request fields, defaults, and sampler ordering. - [[concepts/gbnf-grammars]] — `--grammar`/`--json-schema` flags and `grammar`/`json_schema`/`response_format` request fields for constrained output. - [[concepts/function-calling]] — tool/function calling via `--jinja`, `tools`, `parallel_tool_calls`. - [[concepts/kv-cache-and-context]] — `-c/--ctx-size`, prompt caching (`cache_prompt`, `--cache-reuse`), slot save/restore, context shift, KV cache type flags (`-ctk/-ctv`). - [[concepts/embeddings]] — `--embedding`, `--pooling`, `--embd-normalize`, `/embedding` and `/v1/embeddings`. - [[concepts/multimodal-mtmd]] — `--mmproj`, multimodal data in prompts / `image_url`, `modalities` in `/props`. - [[entities/binary-llama-server]] — the binary documented end-to-end here. ## Source Metadata - Type: official documentation (mirror) - Repo/path: ggml-org/llama.cpp + tools/server/README.md - Fetched: 2026-05-30 from master - URL: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md --- title: "Customization & Tuning — The llama.cpp Knobs" type: synthesis tags: [sampling, kv-cache, grammars, speculative-decoding, performance, synthesis, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/cli-readme.md","raw/server-readme.md","raw/grammars-readme.md"] confidence: high llama_build: "master (~2026-05)" --- # Customization & Tuning — The llama.cpp Knobs This page is a cross-cutting map of (almost) everything you can customize or tune in llama.cpp, organized so that a user — or a video walkthrough — can see all the levers in one place. Rather than re-deriving each subsystem, it indexes the dedicated concept pages and groups their knobs into six families: output quality/style, structured output, context & memory, speed/throughput, hardware/backend, and behavior/serving. Where a knob lives in more than one binary, the table notes where it is set: on [[entities/binary-llama-cli|llama-cli]] (`cli`), on [[entities/binary-llama-server|llama-server]] (`server`), or at build time (`build`). ## Comparison ### The knobs, by family | Lever | What it controls | Key flags / fields | Where set | |---|---|---|---| | **Sampling chain** ([[concepts/sampling-parameters]]) | Randomness, diversity, repetition of the chosen tokens | `--temp`, `--top-k`, `--top-p`, `--min-p`, `--typical`, `--top-n-sigma`, repeat/`--presence-penalty`/`--frequency-penalty`, `--dry-multiplier` (DRY), `--xtc-probability` (XTC), `--mirostat`, `--samplers` / `--sampling-seq` order, `-s`/`--seed` | cli + server | | **Grammars / schema** ([[concepts/gbnf-grammars]]) | Guarantees output *form* (valid JSON, enums, notations) | `--grammar`, `--grammar-file`, `-j`/`--json-schema`; request `grammar`, `json_schema`, `response_format` | cli + server | | **Function calling** ([[concepts/function-calling]]) | Structured `tool_calls` from a `tools` array | `--jinja` (required), `tools`, `parallel_tool_calls`, `--chat-template-file` | server | | **Context & KV cache** ([[concepts/kv-cache-and-context]]) | How much the model attends to, and the memory cost | `-c`/`--ctx-size`, `-ctk`/`-ctv` (`q8_0`/`q4_0`...), `--rope-scaling`/`--rope-scale` + `--yarn-*`, `--cache-prompt`+`--cache-reuse`, `--context-shift` | cli + server | | **Speculative decoding** ([[concepts/speculative-decoding]]) | Faster generation via a draft model or n-gram | `-md`/`--spec-draft-model`, `--spec-type {...}`, `--spec-draft-n-max`, `--spec-ngram-*`, `--spec-default` | cli + server | | **Offload & batching** | Throughput / latency on a given device | `-ngl`/`--n-gpu-layers`, `-fa`/`--flash-attn`, `-np`/`--parallel` + `-cb` continuous batching, `-b`/`-ub` batch sizes | cli + server | | **Quantization** ([[concepts/quantization]]) | Model size, speed, and accuracy floor | quant tag (`:Q4_K_M`, `:Q8_0`...) at model-pick time | model choice | | **Hardware / backend** ([[concepts/build-and-backends]]) | Which processor runs inference, how the model is split | build flags (`-DGGML_CUDA=ON`...), `-dev`/`--device`, `-sm`/`--split-mode`, `-ts`/`--tensor-split`, `-cmoe`/`--cpu-moe` | build + runtime | | **Behavior / serving** ([[concepts/server-api]]) | Prompt format, reasoning, exposure | `--chat-template`/`--jinja`, `-rea`/`--reasoning` + `--reasoning-budget`, system prompt, `--host`/`--port`/`--api-key` | cli + server | ### 1. Output quality/style — the sampling chain The default sampler chain (`--samplers`) is `penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature` (short form `edskypmxt` via `--sampling-seq`). Order matters — moving `temperature` changes the result. Defaults of note: `--temp 0.80`, `--top-k 40`, `--top-p 0.95`, `--min-p 0.05`. DRY, XTC, and mirostat are off by default. See [[concepts/sampling-parameters]]. ### 2. Structured / constrained output [[concepts/gbnf-grammars]] constrains *which tokens are allowed* (`--grammar`, `--json-schema`, or the server `response_format`), and [[concepts/function-calling]] builds on that to emit tool calls — but only with `--jinja` enabled. The grammar/schema is not injected into the prompt (the tool schema is the exception). ### 3. Context & memory `-c`/`--ctx-size` sets the window; `-ctk`/`-ctv` set the KV cache data type (`f16` default; `q8_0`/`q4_0` shrink memory at a quality cost). RoPE/YaRN flags (`--rope-scaling`, `--yarn-*`) push context past the trained length. `--cache-prompt`/`--cache-reuse` reuse shared prefixes; `--context-shift` (off by default) slides the window when full. See [[concepts/kv-cache-and-context]]. ### 4. Speed / throughput [[concepts/speculative-decoding]] (draft models via `-md`, or draft-free n-gram/MTP/Eagle3 via `--spec-type`) speeds generation when acceptance is high. Orthogonal speed levers: `-ngl` GPU offload, `-fa` flash attention, `-np` parallel slots with `-cb` continuous batching, and `-b`/`-ub` batch sizes. Quant choice ([[concepts/quantization]]) trades accuracy for size and speed. ### 5. Hardware / backend Backends are chosen at build time (`-DGGML_CUDA=ON`, `-DGGML_VULKAN=ON`, etc. — see [[concepts/build-and-backends]]) and selected/split at runtime with `-dev`/`--device`, `-sm`/`--split-mode`, `-ts`/`--tensor-split`, and `-cmoe`/`--cpu-moe` for keeping MoE expert tensors on the CPU. ### 6. Behavior / serving Prompt formatting via `--chat-template`/`--jinja`; reasoning via `-rea`/`--reasoning` and `--reasoning-budget`; plus the system prompt and the server-exposure flags ([[concepts/server-api]]). ## Analysis Most of these knobs are not independent — tuning one often pushes against another: - **KV-cache quant vs. quality.** `-ctk`/`-ctv q8_0` (and especially `q4_0`) buys context length and lets you raise `-np`, but it costs accuracy — and the cost is concentrated in precision-sensitive tasks like [[concepts/function-calling|tool calling]]. Shrink the cache before cutting slots, but stop at `q8_0` for quality-sensitive work. - **Sampling diversity vs. determinism.** Raising `--temp`, enabling XTC, or loosening `--top-p`/`--min-p` increases variety but undermines reproducibility. For repeatable output, lower `--temp`, tighten the cutoffs, and pin `-s`/`--seed`. Remember the **CLI vs. server default mismatch**: the server's `/completion` defaults `repeat_penalty` to `1.1` while the CLI default is `1.00` (off) — set it explicitly if you need parity. - **Speculative decoding needs a good draft.** The speedup only materializes when the target accepts most drafted tokens; a poorly matched draft model *adds* overhead. The `--spec-*` flag surface is also fast-moving (the legacy `--draft*` flags were removed), so verify against your build. - **Grammars constrain form, not meaning.** A schema guarantees valid JSON but cannot make the content correct, and pathological patterns (`x? x? x?...`) are slow — prefer bounded `x{0,N}`. - **Throughput levers compete for the same VRAM.** `-c`, `-np`, batch sizes, and `-ngl` all draw on the same memory budget; raising one may force another down. Flash attention (`-fa`) and continuous batching (`-cb`) are mostly free wins that ease this pressure. - **Backend/offload is upstream of everything.** Build-time backend choice and `-ngl` determine whether the GPU-side knobs (KV offload, flash attention) even apply. ## Recommendations **Sane starting points.** Begin from the defaults and change deliberately: - *Sampling:* keep defaults (`--temp 0.80`, `--top-k 40`, `--top-p 0.95`, `--min-p 0.05`). For factual/structured work, drop `--temp` to ~0.2–0.4 and set a fixed `--seed`. Reach for DRY or the repetition penalties only if you observe looping. - *Context:* size `-c` to your real prompts, not the maximum. Try KV cache at `q8_0` only when memory-bound. Only use RoPE/YaRN when you genuinely exceed the trained context. - *Speed:* `-ngl 99`, `-fa auto`, `-cb` on; add `-np` to match real concurrency. Add speculative decoding only after measuring acceptance with a candidate draft model. - *Structured output:* prefer the JSON-Schema path (`-j` / `json_schema` / `response_format`) over hand-written GBNF; add `--jinja` for tool calling. **Method.** Change **one thing at a time and measure.** Use `llama-bench` for throughput/latency deltas and `llama-perplexity` for quality regressions (e.g. before/after a KV-cache or quant change). A knob that "feels" better without a measurement is how tuning sessions go in circles. ## Pages Compared - [[concepts/sampling-parameters]] - [[concepts/gbnf-grammars]] - [[concepts/function-calling]] - [[concepts/kv-cache-and-context]] - [[concepts/speculative-decoding]] - [[concepts/build-and-backends]] - [[concepts/server-api]] - [[concepts/quantization]] - [[entities/binary-llama-cli]] - [[entities/binary-llama-server]] --- title: "llama.cpp vs Ollama" type: synthesis tags: [comparison, synthesis, deployment, vs-ollama, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/llamacpp-readme.md"] confidence: medium llama_build: "n/a (community data 2025; see sources)" --- # llama.cpp vs Ollama The single most important thing to get right here is the **relationship**: Ollama is **not** a competing inference engine. It is a user-friendly wrapper **built on [[entities/project-llama-cpp|llama.cpp]]/ggml** — a Go process that embeds and calls llama.cpp (on Apple Silicon, recent Ollama versions can also use MLX). Per the llama.cpp README, llama.cpp is the underlying C/C++ inference engine, and Ollama, LM Studio, Jan, KoboldCpp, GPT4All, and llamafile are **downstream apps** that build on it. So this is a comparison of a **wrapper vs the engine it wraps**, not of two rival engines. ## Comparison | Dimension | Ollama | llama.cpp (raw) | |---|---|---| | What it is | UX wrapper built on llama.cpp/ggml | The underlying C/C++ inference engine | | Inference code | Shared (llama.cpp/ggml; MLX on some Apple builds) | Native | | Single-user throughput | Essentially the same as llama.cpp | Essentially the same as Ollama | | Honest steady-state gap | ~2–8%, from config/offload defaults | ~2–8%, from config/offload defaults | | Onboarding | Easiest: `ollama run`, auto model mgmt, registry, background server | More setup; full manual control | | Control granularity | Less granular | Full: quant, `-ngl` offload, KV-cache type, sampler chain, grammars, speculative decoding | | New-model support | Lags llama.cpp | First to run newest models/features | | Footprint | Larger (bundled server + mgmt) | Smaller | | Server API | Its own API (over llama.cpp) | OpenAI-compatible [[concepts/server-api|server API]] via [[entities/binary-llama-server|llama-server]] | ## Analysis Because Ollama and llama.cpp **share the same inference code**, at steady state **single-user throughput is essentially the same**. Community consensus puts honest gaps at roughly **2–8%**, and those gaps come from **configuration and offload defaults** (how many layers go to GPU, KV-cache settings, etc.) — **not** from one having a fundamentally faster engine. Be skeptical of blog claims of large multipliers: some cite figures like **1.8x**, but that is an **outlier config artifact**, not a real engine-level difference, and should not be taken as a generalizable result. **Nuance (as of 2026):** Ollama has begun building its *own* model-loading/inference engine for some newer architectures, rather than routing every model through llama.cpp. But that engine is still built on **[[entities/ggml|ggml]]** — the same C tensor library that underpins llama.cpp — and the large majority of models still run through the llama.cpp path. So "Ollama is built on llama.cpp/ggml" remains accurate; what is now slightly overstated is "Ollama is *only* a thin shell over llama.cpp with nothing of its own." The shared **ggml** foundation is why steady-state speed stays close either way. The **real** differences are **UX, packaging, and control** — not raw speed: - **Ollama** gives one-command model pulls (`ollama run`), automatic model management, a model registry, a background server, and sensible defaults — the **easiest onboarding**. The tradeoff is **less granular control**. - **Raw [[entities/project-llama-cpp|llama.cpp]]** gives **full control over every flag**: [[concepts/quantization|quantization]] choice, `-ngl` GPU offload, KV-cache type, sampler chain, grammars, and speculative decoding. It has a **smaller footprint**, runs the **newest models and features first** (Ollama tends to lag llama.cpp on new model support), and exposes its own OpenAI-compatible [[concepts/server-api|server API]] via [[entities/binary-llama-server|llama-server]]. **Honesty / confidence note (confidence: medium):** No dedicated Ollama-vs-llama.cpp source was mirrored into this KB. The **architecture relationship** (Ollama wraps llama.cpp/ggml) is well-established fact drawn from the llama.cpp README, but the specific **tok/s gaps (~2–8%)** rest on **community consensus**, not on a benchmarked source held here. Re-verify any performance numbers on your **target hardware**, and consider **adding a dedicated Ollama-vs-llama.cpp benchmark source** to this KB later. ## Recommendations - **Choose Ollama** for the **easiest setup** and **casual local use** — one-command pulls, automatic model management, and a background server with sensible defaults. - **Choose raw [[entities/project-llama-cpp|llama.cpp]]** when you need **control** (flags, [[concepts/quantization|quant]], KV-cache, sampler chain, grammars, speculative decoding), **newest-model support**, the **smallest footprint**, or to **embed the server** ([[entities/binary-llama-server|llama-server]] / [[concepts/server-api|server API]]). - Remember that **either way you are running llama.cpp under the hood** — so this is a choice about convenience vs control, not about a faster engine. Expect only a **~2–8%** steady-state difference, dominated by config defaults. - For the genuinely different-architecture comparison (batched GPU serving), see [[syntheses/llamacpp-vs-vllm]]. ## Pages Compared - [[entities/project-llama-cpp]] - [[entities/binary-llama-server]] - [[concepts/quantization]] - See also: [[syntheses/llamacpp-vs-vllm]] --- title: "llama.cpp vs vLLM" type: synthesis tags: [comparison, synthesis, performance, deployment, vs-vllm, advanced] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-redhat-vllm-vs-llamacpp.md","raw/community/community-gh15180-vllm-vs-llamacpp.md"] confidence: medium llama_build: "n/a (community data 2025; see sources)" --- # llama.cpp vs vLLM These two projects are often benchmarked head-to-head, but the comparison is **not apples-to-apples**. [[entities/project-llama-cpp|llama.cpp]] is a single-user-oriented, quantized ([[concepts/quantization|GGUF]]) inference engine that is portable across CPU, Apple Silicon, and unusual hardware. vLLM is a batched, multi-user GPU serving engine (PagedAttention + continuous batching) that shines at high concurrency but expects full-precision/AWQ/GPTQ weights on capable GPUs. Lead with that framing: they optimize for different deployment shapes. ## Comparison | Dimension | llama.cpp | vLLM | |---|---|---| | Primary target | Single / few users, local & edge | Many concurrent users, GPU serving | | Core technique | Quantized GGUF inference, CPU/GPU offload | PagedAttention + continuous batching | | Weights | Quantized GGUF (also full precision) | Full precision / AWQ / GPTQ | | Hardware | CPU, Apple Silicon, mixed/odd hardware, GPU | Capable GPUs | | Portability | Very high (see [[concepts/build-and-backends]]) | GPU-centric | | Single-request latency | Within a few % of vLLM (GH#15180) | Within a few % of llama.cpp (GH#15180) | | High-concurrency throughput | Historically weaker; closing gap with paged attention | Very strong; >35x req throughput at peak (Red Hat) | | P99 TTFT under load | Rises exponentially under concurrency (Red Hat) | Stays flat under concurrency (Red Hat) | | Inter-token latency | Low / stable (Red Hat) | — | ## Analysis The two most relevant data points in this KB tell a consistent story once you separate **single-stream latency** from **concurrency-at-scale aggregate throughput**. The Red Hat benchmark (2025-09-30; single H200, Llama-3.1-8B, GuideLLM, 1–64 concurrent users) reports that at **peak load** vLLM delivered **>35x request throughput** and **>44x output tokens/s** versus llama.cpp, with vLLM's P99 TTFT staying flat while llama.cpp's rose exponentially as concurrency climbed. Crucially, those 35–44x figures are **concurrency-at-scale aggregate throughput** across many simultaneous users — they do **not** mean "vLLM is 40x faster per request." The Red Hat report also notes llama.cpp kept low, stable inter-token latency. Treat the magnitudes with caution: this is **vendor data** (Red Hat backs vLLM), so the framing favors vLLM's strengths. The GH#15180 thread (2025-08-08; single RTX 4090, Qwen2.5-3B) is a **fair, apples-to-apples** test by llama.cpp contributor JohannesGaessler. For a **single request**, llama.cpp took **93.6–100.2% of vLLM's time** — i.e. mostly **3–6% faster**. At **16 parallel** requests, llama.cpp took **99.2–125.6% of vLLM's time** (vLLM up to ~25% faster, with the gap shrinking at deep context). Later in the same thread (Apr 2026), llama.cpp's **paged-attention** work scaled to **247 concurrent sequences** (vLLM-like), roughly **2.5x aggregate throughput** versus before, landing **within ~3%** of unified at equal concurrency. Reconciling the two: **single-stream, the engines are within a few percent of each other.** vLLM's large wins are specifically in **many-concurrent-user serving**, and llama.cpp is actively **closing that concurrency gap** with paged attention. Staleness caveat: both data points are roughly 8–10 months old and both projects move fast, so re-verify on current builds and your own hardware. ## Recommendations - **Choose vLLM** when you are serving **many simultaneous users at high concurrency** on **capable GPUs** with full-precision/AWQ/GPTQ weights, and aggregate throughput plus flat TTFT under load matter most. - **Choose [[entities/project-llama-cpp|llama.cpp]]** for **single or few users**, **quantized models**, **CPU / Apple Silicon / mixed hardware**, maximum **portability**, or when you need **GGUF / [[concepts/quantization|quantization]] flexibility**. Its [[entities/binary-llama-server|server]] exposes an OpenAI-compatible [[concepts/server-api|API]]. - Note that llama.cpp's **paged attention is narrowing the concurrency gap**, so the historical "vLLM for concurrency" rule is weakening — benchmark current builds before committing. - For the easier-onboarding wrapper question, see [[syntheses/llamacpp-vs-ollama]]. ## Pages Compared - [[summaries/community-redhat-vllm-vs-llamacpp]] - [[summaries/community-gh15180-vllm-vs-llamacpp]] - [[entities/project-llama-cpp]] - See also: [[syntheses/llamacpp-vs-ollama]] --- title: "Quant Types Compared — Which GGUF Quant Should You Pick?" type: synthesis tags: [quantization, imatrix, accuracy, comparison, synthesis, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-pr1684-kquants.md","raw/community/community-artefact2-quant-table.md","raw/community/community-arxiv-quant-eval.md","raw/community/community-kaitchup-gguf-guide.md","raw/community/community-bartowski-quant-guide.md","raw/community/community-mradermacher-imatrix.md","raw/community/community-unsloth-dynamic-ggufs.md"] confidence: medium llama_build: "n/a (community data 2023-2026; see sources)" --- # Quant Types Compared — Which GGUF Quant Should You Pick? Picking a [[concepts/quantization]] level for a [[concepts/gguf-format]] model is a trade between file size, speed, and accuracy. This page consolidates community measurements into one decision-oriented view: what each quant costs in bits, how much quality it gives up versus F16, and when to reach for it. ## Comparison The table below merges bits-per-weight figures from PR#1684 with the canonical Artefact2 KL-divergence table (Mistral-7B, Feb 2024). Lower KL-median means closer to the full-precision model. "~bpw" is approximate and varies slightly by architecture. | Quant | ~bpw | Quality vs F16 | Use when | |-------|------|----------------|----------| | F16 | 16 | Reference (lossless) | Baseline / you have the VRAM to spare | | Q8_0 | 8.5 | Effectively lossless | You want maximum fidelity below F16 | | Q6_K | 6.56 (PR#1684) / 6.57 bpw, KL 0.0032 (Artefact2) | ≈ lossless, within ~0.1% PPL of F16 | High-quality target; imatrix barely helps here | | Q5_K_M | 5.5 (PR#1684) / 5.67 bpw, KL 0.0043 (Artefact2) | Very close to F16 | Quality-leaning pick when Q6 won't fit | | Q4_K_M | 4.5 (PR#1684) / 4.83 bpw, KL 0.0075 (Artefact2) | Slight, usually acceptable loss; avg bench 69.15 vs F16 69.47 on Llama-3.1-8B (arXiv) | **Default sweet spot for most people most of the time** | | IQ4_XS | 4.32 bpw, KL 0.0088 (Artefact2) | Near Q4_K_M at fewer bits | Sub-Q4_K_M but still wanting good accuracy; needs imatrix | | Q3_K_M | 3.89 bpw, KL 0.0171 (Artefact2) | Noticeable loss | Tight on memory; expect quality to start slipping | | IQ3_S | 3.52 bpw, KL 0.0205 (Artefact2) | Beats Q3_K_S at similar size | ~3-bit range; prefer over K-quant here (with imatrix) | | Q2_K | 3.00 bpw, KL 0.0588 (Artefact2) | Quality has collapsed (arXiv Q3_K_S avg 65.49 / ppl 8.96) | Last resort; large models only | | IQ2_M | 2.76 bpw, KL 0.0702 (Artefact2) | Beats Q2_K_S (2.79 bpw, KL 0.0829) | Squeezing a big model into small VRAM; needs imatrix | | IQ1_S | 1.78 bpw, KL 0.5495 (Artefact2) | Severe degradation | Extreme size constraints only | Bits/weight from PR#1684: Q2_K 2.56, Q3_K 3.44, Q4_K 4.5, Q5_K 5.5, Q6_K 6.56, Q8_0 8.5, F16 16. KL-median drops roughly 170x from IQ1_S to Q6_K (Artefact2). **Size on disk** (Llama-3.1-8B-Instruct vs F16, arXiv): Q3_K_S -77%, Q4_K_M -69%, Q5_K_M -64%, Q6_K -59%, Q8_0 -47%. Absolute sizes (MiB): F16 15317, Q4_K_M 4685, Q6_K 6283, Q8_0 8138. ## Analysis **Headline rules of thumb (cross-source consensus):** - **Q6_K is effectively lossless.** PR#1684 reports LLaMA-7B Q6_K perplexity 5.9110 vs F16 5.9066 (+0.074%); Artefact2 shows Q6_K ln-PPL ≈ -0.0008; arXiv shows Llama-3.1-8B Q6_K ppl 7.35 vs F16 7.32. Three independent sources, same conclusion. - **Q4_K_M is the default.** bartowski calls it the "recommended default," Kaitchup the "most downloaded 4-bit default," and arXiv puts its average benchmark at 69.15 vs F16 69.47 — within ~0.3 points. - **Below ~Q3, quality collapses.** arXiv measures Q3_K_S at avg 65.49 / ppl 8.96, a sharp drop from the 4-bit tier. - **Sub-Q4 means I-quants WITH a good [[concepts/imatrix]].** bartowski, mradermacher, and Kaitchup all agree. An IQ-quant generally beats a similarly-sized K-quant: Artefact2 shows IQ2_M (2.76 bpw, KL 0.0702) beating Q2_K_S (2.79 bpw, KL 0.0829), and IQ3_S beating Q3_K_S. - **Fit-to-hardware:** pick the largest quant that fits VRAM for max speed, or VRAM+RAM for max quality (bartowski, Artefact2). Artefact2 adds a useful tie-breaker: if Q4_K_S fits comfortably, prefer a **bigger model** over more bits. - **imatrix benefit shrinks toward Q6.** mradermacher notes "i1-Q6_K practically like static Q6_K"; the biggest imatrix gains are below Q5_K_M. **Quant families** (Kaitchup, PR#1684): *Legacy* (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0) exist for compatibility only now. *K-quants* (Q2_K..Q6_K) use super-blocks with selective per-tensor precision and come in _S/_M/_L mixes. *I-quants* (IQ1..IQ4) are codebook-based over 256-weight super-blocks and **require an imatrix**; IQ4_NL is a 32-weight non-linear variant. A "_L"/"_XL" suffix keeps embedding and output tensors at Q8_0 (bartowski). **Caveats and conflicts — read these before trusting any single number:** - **Perplexity is not comparable across sources.** PR#1684 (LLaMA-1-7B) sits on a ~5.9 scale; arXiv (Llama-3.1-8B) on a ~7.3 scale. Only the *relative ordering* transfers between sources, never the absolute value. - **Unsloth is a vendor source.** Its "Dynamic 2.0" KLD/MMLU wins (e.g. Gemma-3-27B Q4_K_XL MMLU 71.47 vs Google QAT 70.64) are self-reported, and it disputes perplexity in favor of its own KLD framing. Treat these as vendor claims, not neutral measurements. - **Coverage gaps everywhere.** Artefact2 is 2024/Mistral-7B and lacks the newest IQ types; arXiv is a single-author preprint on one model with no IQ or KL data. No single source has everything — triangulate. - **Measure your own model.** Perplexity (ppl) and KL-divergence (kld) are the metrics to track, and both are minimized by a good [[concepts/imatrix]]. ## Recommendations 1. **Most users, most of the time:** Q4_K_M. Best balance of size, speed, and accuracy; the consensus default. 2. **Quality-first (and it fits):** Q5_K_M or Q6_K. Q6_K is the practical ceiling — beyond it you're paying a lot of bits for nearly nothing. 3. **Tight on memory, below Q4:** use I-quants (IQ4_XS, IQ3_S, IQ2_M) built WITH an imatrix via [[entities/binary-imatrix]], and run them on CUDA/ROCm. I-quants work on CPU but are slower there (bartowski). 4. **Sizing rule (bartowski):** for max speed, pick a quant 1–2 GB smaller than your VRAM; for max quality, 1–2 GB smaller than VRAM+RAM. K-quant by default; reach for I-quant only when going below Q4 and on a CUDA/ROCm GPU. 5. **Bigger model vs more bits:** if Q4_K_S of a larger model fits comfortably, prefer it over a higher-bit quant of a smaller model (Artefact2). 6. **Don't go below ~Q3 unless forced** — quality collapses, and only large models tolerate it. 7. **Build quants with** [[entities/binary-llama-quantize]], supply an imatrix for any sub-Q5 target, and validate with ppl + kld on your own data rather than trusting cross-source tables blindly. ## Pages Compared - [[summaries/community-pr1684-kquants]] — bits/weight definitions and the original LLaMA-7B Q6_K ≈ F16 result. - [[summaries/community-artefact2-quant-table]] — the canonical KL-divergence table (Mistral-7B, Feb 2024). - [[summaries/community-arxiv-quant-eval]] — Llama-3.1-8B benchmark/perplexity/size-reduction preprint. - [[summaries/community-kaitchup-gguf-guide]] — quant family taxonomy and 4-bit default guidance. - [[summaries/community-bartowski-quant-guide]] — sizing rules and K-quant vs I-quant selection. - [[summaries/community-mradermacher-imatrix]] — where imatrix helps and where it stops mattering. - [[summaries/community-unsloth-dynamic-ggufs]] — vendor "Dynamic 2.0" claims (read critically). - [[concepts/quantization]] — underlying concept reference. --- title: "Deploying llama-server as an OpenAI-Compatible API" type: synthesis tags: [deployment, server-api, api, comparison, synthesis, intermediate] created: 2026-05-30 updated: 2026-05-30 sources: ["raw/community/community-hf-gguf-usage.md","raw/community/community-steelphoenix-guide.md","raw/server-readme.md"] confidence: medium llama_build: "master (~2026-05) + community guides 2024-2026" --- # Deploying llama-server as an OpenAI-Compatible API [[entities/binary-llama-server|llama-server]] is a single self-contained C/C++ binary that exposes llama.cpp inference over HTTP, serving OpenAI- and Anthropic-compatible endpoints alongside llama.cpp's own native ones (see [[concepts/server-api]]). Because the OpenAI surface is a drop-in replacement, existing OpenAI SDKs can talk to a local model by changing only the `base_url`. This page compares the three main ways to get it running — a one-command pull from Hugging Face, a build-from-source deployment, and a containerized/cloud deployment — and then covers how to secure and size it for real use. ## Comparison ### Deployment paths at a glance | | (A) `-hf` one-command | (B) Build-from-source + manual model | (C) Docker / cloud | |---|---|---|---| | Get the binary | Prebuilt install (or any build) | `cmake` build yourself | `ghcr.io/ggml-org/llama.cpp` image | | Get the model | Auto-download from HF, cached | You supply the `.gguf` | Mount a models volume (or `-hf`) | | Best for | Fast local try-out, dev | Custom backends, latest master, tuned hardware | Servers, reproducible/portable deploys | | GPU | Whatever the install supports | You pick backend flags at build time | `:server-cuda` image + `--gpus all` | | Effort | Lowest | Highest | Medium | ### (A) One-command run from Hugging Face ```sh llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` The `:Q8_0` tag selects the quant; with no tag the default is `Q4_K_M`. The model is auto-downloaded and cached (cache location set by the `LLAMA_CACHE` environment variable). Add `-no-cnv` for raw completion mode rather than chat/conversation mode. See [[concepts/gguf-format]] and [[concepts/quantization]] for what the quant tag means. The server defaults to `http://127.0.0.1:8080`. Test the OpenAI-compatible endpoint as a drop-in: ```sh curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer no-key" \ -d '{"messages":[...]}' ``` Then point any OpenAI SDK at `base_url = http://127.0.0.1:8080/v1`. ### (B) Build-from-source + manual model From the SteelPh0enix guide (verify flags against the current README — see Staleness): ```sh cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release # add a backend, e.g. -DGGML_CUDA=ON or -DGGML_VULKAN=ON cmake --build build --config Release -j ``` `llama-server` is built by default now. See [[concepts/build-and-backends]] for the full backend matrix. Then launch against a local file: ```sh llama-server -m /models/model.gguf -c 4096 -ngl 99 ``` ### (C) Docker / cloud Images: `ghcr.io/ggml-org/llama.cpp:server` (CPU) and `:server-cuda` (CUDA). ```sh docker run -p 8080:8080 -v /models:/models --gpus all \ ghcr.io/ggml-org/llama.cpp:server-cuda \ -m /models/model.gguf -c 4096 --host 0.0.0.0 --port 8080 -ngl 99 ``` Note that inside a container you generally want `--host 0.0.0.0` so the port maps out — which is exactly the case that demands authentication (see Recommendations). ### Key production flags Drawn from [[concepts/server-api]] and [[entities/binary-llama-server]]: | Flag | Purpose | |---|---| | `-c` / `--ctx-size` | Context window size ([[concepts/kv-cache-and-context]]) | | `-ngl` / `--n-gpu-layers` | GPU layer offload (`auto` / `all`) | | `-np` / `--parallel` | Number of concurrent slots | | `-cb` / `--cont-batching` | Continuous batching (on by default) | | `-fa` / `--flash-attn` | Flash attention | | `--host 0.0.0.0` / `--port` | Bind address (exposes to network) | | `--api-key` / `--api-key-file` | Auth (env `LLAMA_API_KEY`) | | `--ssl-key-file` / `--ssl-cert-file` | Enable HTTPS | | `--metrics` | Prometheus metrics (off by default) | | `--jinja` | Chat templates / tool calling ([[concepts/function-calling]]) | | `--slot-save-path` + `--cache-prompt` | Prompt caching / KV state | ### Router / multi-model mode Launch with **no** `-m` to enter router mode and route requests by the `"model"` JSON field; configure with `--models-dir` and `--models-max`. See [[concepts/server-api]] for the full router options. ## Analysis The three paths are not mutually exclusive — they share the same binary and the same flag surface; they differ only in how you obtain the binary and the model. The decision is mostly about control vs. convenience: - **(A) `-hf`** is the lowest-friction route and is ideal for local development and quick evaluation. Its weakness is that you inherit whatever backend the prebuilt install was compiled with, and the auto-download means your model choice is implicit in a flag rather than a managed artifact. - **(B) Build-from-source** is the only path that lets you pick exactly which compute backend(s) to compile in and to track the fast-moving master branch. The cost is build complexity and that you manage model files yourself. - **(C) Docker** trades some of that control for reproducibility and portability — the right call for actual servers and cloud hosts. The CUDA image plus `--gpus all` is the standard GPU container recipe. Two cross-cutting realities tie all three together. First, **exposure and auth are coupled**: the server binds to `127.0.0.1` by default (safe), but every "real deployment" reason to change `--host 0.0.0.0` (Docker port mapping, remote clients) is simultaneously a reason you now need `--api-key` and TLS. Second, **concurrency is a sizing decision, not a default**: `-np` sets how many slots exist, and continuous batching (on by default) is what makes those slots progress together — under-provisioning slots serializes clients regardless of GPU headroom. ### Staleness notes - The **SteelPh0enix guide is late-2024**. It predates `-hf` becoming the default-style one-command flow, so its flag set should be re-verified against the current server README before relying on it. - The **Hugging Face usage doc references the old `ggerganov/llama.cpp` repo URL**; the project now lives at `ggml-org/llama.cpp` (and the Docker images are under `ghcr.io/ggml-org/...`). Treat repo/URL references in that doc as renamed. ## Recommendations **Securing it.** Never set `--host 0.0.0.0` without protection. At minimum add `--api-key` (or `--api-key-file`, env `LLAMA_API_KEY`); for anything reachable beyond a trusted LAN, terminate TLS with `--ssl-key-file`/`--ssl-cert-file` or, preferably, put it behind a reverse proxy that handles auth, TLS, and rate-limiting. Inside Docker, prefer publishing the port only where needed and keeping the API key out of the image (mount a key file or pass the env var). **Sizing for concurrency.** Set `-np` to the number of simultaneous requests you expect to serve, keep continuous batching on (the default), and size `-c` per slot with your VRAM budget in mind — every slot consumes its own KV cache. If memory is tight, quantize the KV cache (`-ctk`/`-ctv q8_0`) before sacrificing slot count, but note the quality cost (especially for tool calling — see [[concepts/function-calling]]). **Prompt caching.** Leave `--cache-prompt` on and request `cache_prompt:true` per call so shared prefixes (long system prompts, RAG context) are not recomputed. For workloads that resume sessions, persist slot KV state with `--slot-save-path` plus the `save`/`restore` slot actions. **Observability.** Pass `--metrics` to expose Prometheus metrics — it is off by default, so production deployments that skip it are flying blind. Use `GET /health` (`200`/`503`) for readiness probes. **Sane defaults to start.** `-ngl 99` (offload everything you can), `-fa auto`, `-cb` on, `-c` sized to your real prompts, `-np` matched to expected concurrency, and `--jinja` if you need chat templates or tool calling. ## Pages Compared - [[summaries/community-hf-gguf-usage]] - [[summaries/community-steelphoenix-guide]] - [[summaries/server-readme]] - [[concepts/server-api]] - [[entities/binary-llama-server]] - [[concepts/build-and-backends]] - [[concepts/gguf-format]] - [[concepts/quantization]] - [[concepts/function-calling]] - [[concepts/kv-cache-and-context]]