wikis / llama.cpp / wiki / syntheses / customization-and-tuning.md view as markdown

Customization & Tuning — The llama.cpp Knobs

type: synthesisconfidence: highupdated: 2026-05-30llama_build: master (~2026-05)sources: 3

This page is a cross-cutting map of (almost) everything you can customize or tune in llama.cpp, organized so that a user — or a video walkthrough — can see all the levers in one place. Rather than re-deriving each subsystem, it indexes the dedicated concept pages and groups their knobs into six families: output quality/style, structured output, context & memory, speed/throughput, hardware/backend, and behavior/serving. Where a knob lives in more than one binary, the table notes where it is set: on llama-cli (cli), on llama-server (server), or at build time (build).

Comparison

The knobs, by family

Lever	What it controls	Key flags / fields	Where set
Sampling chain (sampling parameters)	Randomness, diversity, repetition of the chosen tokens	`--temp`, `--top-k`, `--top-p`, `--min-p`, `--typical`, `--top-n-sigma`, repeat/`--presence-penalty`/`--frequency-penalty`, `--dry-multiplier` (DRY), `--xtc-probability` (XTC), `--mirostat`, `--samplers` / `--sampling-seq` order, `-s`/`--seed`	cli + server
Grammars / schema (gbnf grammars)	Guarantees output form (valid JSON, enums, notations)	`--grammar`, `--grammar-file`, `-j`/`--json-schema`; request `grammar`, `json_schema`, `response_format`	cli + server
Function calling (function calling)	Structured `tool_calls` from a `tools` array	`--jinja` (required), `tools`, `parallel_tool_calls`, `--chat-template-file`	server
Context & KV cache (kv cache and context)	How much the model attends to, and the memory cost	`-c`/`--ctx-size`, `-ctk`/`-ctv` (`q8_0`/`q4_0`...), `--rope-scaling`/`--rope-scale` + `--yarn-*`, `--cache-prompt`+`--cache-reuse`, `--context-shift`	cli + server
Speculative decoding (speculative decoding)	Faster generation via a draft model or n-gram	`-md`/`--spec-draft-model`, `--spec-type {...}`, `--spec-draft-n-max`, `--spec-ngram-*`, `--spec-default`	cli + server
Offload & batching	Throughput / latency on a given device	`-ngl`/`--n-gpu-layers`, `-fa`/`--flash-attn`, `-np`/`--parallel` + `-cb` continuous batching, `-b`/`-ub` batch sizes	cli + server
Quantization (quantization)	Model size, speed, and accuracy floor	quant tag (`:Q4_K_M`, `:Q8_0`...) at model-pick time	model choice
Hardware / backend (build and backends)	Which processor runs inference, how the model is split	build flags (`-DGGML_CUDA=ON`...), `-dev`/`--device`, `-sm`/`--split-mode`, `-ts`/`--tensor-split`, `-cmoe`/`--cpu-moe`	build + runtime
Behavior / serving (server api)	Prompt format, reasoning, exposure	`--chat-template`/`--jinja`, `-rea`/`--reasoning` + `--reasoning-budget`, system prompt, `--host`/`--port`/`--api-key`	cli + server

1. Output quality/style — the sampling chain

The default sampler chain (--samplers) is penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature (short form edskypmxt via --sampling-seq). Order matters — moving temperature changes the result. Defaults of note: --temp 0.80, --top-k 40, --top-p 0.95, --min-p 0.05. DRY, XTC, and mirostat are off by default. See sampling parameters.

2. Structured / constrained output

gbnf grammars constrains which tokens are allowed (--grammar, --json-schema, or the server response_format), and function calling builds on that to emit tool calls — but only with --jinja enabled. The grammar/schema is not injected into the prompt (the tool schema is the exception).

3. Context & memory

-c/--ctx-size sets the window; -ctk/-ctv set the KV cache data type (f16 default; q8_0/q4_0 shrink memory at a quality cost). RoPE/YaRN flags (--rope-scaling, --yarn-*) push context past the trained length. --cache-prompt/--cache-reuse reuse shared prefixes; --context-shift (off by default) slides the window when full. See kv cache and context.

4. Speed / throughput

speculative decoding (draft models via -md, or draft-free n-gram/MTP/Eagle3 via --spec-type) speeds generation when acceptance is high. Orthogonal speed levers: -ngl GPU offload, -fa flash attention, -np parallel slots with -cb continuous batching, and -b/-ub batch sizes. Quant choice (quantization) trades accuracy for size and speed.

5. Hardware / backend

Backends are chosen at build time (-DGGML_CUDA=ON, -DGGML_VULKAN=ON, etc. — see build and backends) and selected/split at runtime with -dev/--device, -sm/--split-mode, -ts/--tensor-split, and -cmoe/--cpu-moe for keeping MoE expert tensors on the CPU.

6. Behavior / serving

Prompt formatting via --chat-template/--jinja; reasoning via -rea/--reasoning and --reasoning-budget; plus the system prompt and the server-exposure flags (server api).

Analysis

Most of these knobs are not independent — tuning one often pushes against another:

KV-cache quant vs. quality. -ctk/-ctv q8_0 (and especially q4_0) buys context length and lets you raise -np, but it costs accuracy — and the cost is concentrated in precision-sensitive tasks like tool calling. Shrink the cache before cutting slots, but stop at q8_0 for quality-sensitive work.
Sampling diversity vs. determinism. Raising --temp, enabling XTC, or loosening --top-p/--min-p increases variety but undermines reproducibility. For repeatable output, lower --temp, tighten the cutoffs, and pin -s/--seed. Remember the CLI vs. server default mismatch: the server's /completion defaults repeat_penalty to 1.1 while the CLI default is 1.00 (off) — set it explicitly if you need parity.
Speculative decoding needs a good draft. The speedup only materializes when the target accepts most drafted tokens; a poorly matched draft model adds overhead. The --spec-* flag surface is also fast-moving (the legacy --draft* flags were removed), so verify against your build.
Grammars constrain form, not meaning. A schema guarantees valid JSON but cannot make the content correct, and pathological patterns (x? x? x?...) are slow — prefer bounded x{0,N}.
Throughput levers compete for the same VRAM. -c, -np, batch sizes, and -ngl all draw on the same memory budget; raising one may force another down. Flash attention (-fa) and continuous batching (-cb) are mostly free wins that ease this pressure.
Backend/offload is upstream of everything. Build-time backend choice and -ngl determine whether the GPU-side knobs (KV offload, flash attention) even apply.

Recommendations

Sane starting points. Begin from the defaults and change deliberately:

Sampling: keep defaults (--temp 0.80, --top-k 40, --top-p 0.95, --min-p 0.05). For factual/structured work, drop --temp to ~0.2–0.4 and set a fixed --seed. Reach for DRY or the repetition penalties only if you observe looping.
Context: size -c to your real prompts, not the maximum. Try KV cache at q8_0 only when memory-bound. Only use RoPE/YaRN when you genuinely exceed the trained context.
Speed: -ngl 99, -fa auto, -cb on; add -np to match real concurrency. Add speculative decoding only after measuring acceptance with a candidate draft model.
Structured output: prefer the JSON-Schema path (-j / json_schema / response_format) over hand-written GBNF; add --jinja for tool calling.

Method. Change one thing at a time and measure. Use llama-bench for throughput/latency deltas and llama-perplexity for quality regressions (e.g. before/after a KV-cache or quant change). A knob that "feels" better without a measurement is how tuning sessions go in circles.