wikis / llama.cpp / wiki / syntheses / customization-and-tuning.md view as markdown
Customization & Tuning โ The llama.cpp Knobs
This page is a cross-cutting map of (almost) everything you can customize or tune in llama.cpp, organized so that a user โ or a video walkthrough โ can see all the levers in one place. Rather than re-deriving each subsystem, it indexes the dedicated concept pages and groups their knobs into six families: output quality/style, structured output, context & memory, speed/throughput, hardware/backend, and behavior/serving. Where a knob lives in more than one binary, the table notes where it is set: on llama-cli (cli), on llama-server (server), or at build time (build).
Comparison
The knobs, by family
| Lever | What it controls | Key flags / fields | Where set |
|---|---|---|---|
| Sampling chain (sampling parameters) | Randomness, diversity, repetition of the chosen tokens | --temp, --top-k, --top-p, --min-p, --typical, --top-n-sigma, repeat/--presence-penalty/--frequency-penalty, --dry-multiplier (DRY), --xtc-probability (XTC), --mirostat, --samplers / --sampling-seq order, -s/--seed |
cli + server |
| Grammars / schema (gbnf grammars) | Guarantees output form (valid JSON, enums, notations) | --grammar, --grammar-file, -j/--json-schema; request grammar, json_schema, response_format |
cli + server |
| Function calling (function calling) | Structured tool_calls from a tools array |
--jinja (required), tools, parallel_tool_calls, --chat-template-file |
server |
| Context & KV cache (kv cache and context) | How much the model attends to, and the memory cost | -c/--ctx-size, -ctk/-ctv (q8_0/q4_0...), --rope-scaling/--rope-scale + --yarn-*, --cache-prompt+--cache-reuse, --context-shift |
cli + server |
| Speculative decoding (speculative decoding) | Faster generation via a draft model or n-gram | -md/--spec-draft-model, --spec-type {...}, --spec-draft-n-max, --spec-ngram-*, --spec-default |
cli + server |
| Offload & batching | Throughput / latency on a given device | -ngl/--n-gpu-layers, -fa/--flash-attn, -np/--parallel + -cb continuous batching, -b/-ub batch sizes |
cli + server |
| Quantization (quantization) | Model size, speed, and accuracy floor | quant tag (:Q4_K_M, :Q8_0...) at model-pick time |
model choice |
| Hardware / backend (build and backends) | Which processor runs inference, how the model is split | build flags (-DGGML_CUDA=ON...), -dev/--device, -sm/--split-mode, -ts/--tensor-split, -cmoe/--cpu-moe |
build + runtime |
| Behavior / serving (server api) | Prompt format, reasoning, exposure | --chat-template/--jinja, -rea/--reasoning + --reasoning-budget, system prompt, --host/--port/--api-key |
cli + server |
1. Output quality/style โ the sampling chain
The default sampler chain (--samplers) is penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature (short form edskypmxt via --sampling-seq). Order matters โ moving temperature changes the result. Defaults of note: --temp 0.80, --top-k 40, --top-p 0.95, --min-p 0.05. DRY, XTC, and mirostat are off by default. See sampling parameters.
2. Structured / constrained output
gbnf grammars constrains which tokens are allowed (--grammar, --json-schema, or the server response_format), and function calling builds on that to emit tool calls โ but only with --jinja enabled. The grammar/schema is not injected into the prompt (the tool schema is the exception).
3. Context & memory
-c/--ctx-size sets the window; -ctk/-ctv set the KV cache data type (f16 default; q8_0/q4_0 shrink memory at a quality cost). RoPE/YaRN flags (--rope-scaling, --yarn-*) push context past the trained length. --cache-prompt/--cache-reuse reuse shared prefixes; --context-shift (off by default) slides the window when full. See kv cache and context.
4. Speed / throughput
speculative decoding (draft models via -md, or draft-free n-gram/MTP/Eagle3 via --spec-type) speeds generation when acceptance is high. Orthogonal speed levers: -ngl GPU offload, -fa flash attention, -np parallel slots with -cb continuous batching, and -b/-ub batch sizes. Quant choice (quantization) trades accuracy for size and speed.
5. Hardware / backend
Backends are chosen at build time (-DGGML_CUDA=ON, -DGGML_VULKAN=ON, etc. โ see build and backends) and selected/split at runtime with -dev/--device, -sm/--split-mode, -ts/--tensor-split, and -cmoe/--cpu-moe for keeping MoE expert tensors on the CPU.
6. Behavior / serving
Prompt formatting via --chat-template/--jinja; reasoning via -rea/--reasoning and --reasoning-budget; plus the system prompt and the server-exposure flags (server api).
Analysis
Most of these knobs are not independent โ tuning one often pushes against another:
- KV-cache quant vs. quality.
-ctk/-ctv q8_0(and especiallyq4_0) buys context length and lets you raise-np, but it costs accuracy โ and the cost is concentrated in precision-sensitive tasks like tool calling. Shrink the cache before cutting slots, but stop atq8_0for quality-sensitive work. - Sampling diversity vs. determinism. Raising
--temp, enabling XTC, or loosening--top-p/--min-pincreases variety but undermines reproducibility. For repeatable output, lower--temp, tighten the cutoffs, and pin-s/--seed. Remember the CLI vs. server default mismatch: the server's/completiondefaultsrepeat_penaltyto1.1while the CLI default is1.00(off) โ set it explicitly if you need parity. - Speculative decoding needs a good draft. The speedup only materializes when the target accepts most drafted tokens; a poorly matched draft model adds overhead. The
--spec-*flag surface is also fast-moving (the legacy--draft*flags were removed), so verify against your build. - Grammars constrain form, not meaning. A schema guarantees valid JSON but cannot make the content correct, and pathological patterns (
x? x? x?...) are slow โ prefer boundedx{0,N}. - Throughput levers compete for the same VRAM.
-c,-np, batch sizes, and-nglall draw on the same memory budget; raising one may force another down. Flash attention (-fa) and continuous batching (-cb) are mostly free wins that ease this pressure. - Backend/offload is upstream of everything. Build-time backend choice and
-ngldetermine whether the GPU-side knobs (KV offload, flash attention) even apply.
Recommendations
Sane starting points. Begin from the defaults and change deliberately:
- Sampling: keep defaults (
--temp 0.80,--top-k 40,--top-p 0.95,--min-p 0.05). For factual/structured work, drop--tempto ~0.2โ0.4 and set a fixed--seed. Reach for DRY or the repetition penalties only if you observe looping. - Context: size
-cto your real prompts, not the maximum. Try KV cache atq8_0only when memory-bound. Only use RoPE/YaRN when you genuinely exceed the trained context. - Speed:
-ngl 99,-fa auto,-cbon; add-npto match real concurrency. Add speculative decoding only after measuring acceptance with a candidate draft model. - Structured output: prefer the JSON-Schema path (
-j/json_schema/response_format) over hand-written GBNF; add--jinjafor tool calling.
Method. Change one thing at a time and measure. Use llama-bench for throughput/latency deltas and llama-perplexity for quality regressions (e.g. before/after a KV-cache or quant change). A knob that "feels" better without a measurement is how tuning sessions go in circles.
