wikis / llama.cpp / wiki / entities / binary-llama-cli.md view as markdown
llama-cli
Overview
llama-cli (in tools/cli) is llama.cpp's primary inference command-line tool.
It handles both one-shot prompting and interactive chat against a GGUF model. Its
--help text is auto-generated by llama-gen-docs and is grouped into Common,
Sampling, and CLI-specific sections.
Characteristics
- Two modes: one-shot completion and interactive conversation/chat.
- Conversation flags:
-cnv/--conversation(auto-enabled when the model has a chat template, and turns on interactive mode),-no-cnvto disable,-st/--single-turn,-r/--reverse-prompt,-sys/--system-prompt(plus-sysffor a file),-mli/--multiline-input. - Generation / offload flags (shared with the server):
-n/--predict(-1= infinite),-c/--ctx-size(0= from model),-b 2048,-ub 512,-ngl/--n-gpu-layers(auto / number / all),-sm/--split-mode {none,layer,row,tensor}(defaultlayer),-mg/--main-gpu(0),-fa/--flash-attn(auto),-dev/--device,--list-devices,-ts/--tensor-split. - Grammar / structured output:
--grammar,--grammar-file,-j/--json-schema,-jf. - Multimodal:
-mm/--mmproj,--image,--audio. - LoRA / control vectors:
--lora,--lora-scaled FNAME:SCALE. - Templating:
--jinja(on),--chat-template(large built-in list),--reasoning-format,-rea/--reasoning,--reasoning-budget. - Quick presets:
--gpt-oss-20b-default,--gpt-oss-120b-default,--vision-gemma-4b/12b-default,--spec-default.
How to Use
One-shot prompt:
llama-cli -m model.gguf -p "prompt"
llama-cli -m model.gguf -p "Once upon a time"
Conversation / chat:
llama-cli -m model.gguf -cnv
Run a model straight from Hugging Face:
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
Related Entities
- sampling parameters โ the Sampling group of flags this tool exposes.
- kv cache and context โ context size and KV cache flags above.
- speculative decoding โ
--spec-*/--spec-defaultdrafting support. - gbnf grammars โ the
--grammar/--json-schemaoptions. - multimodal mtmd โ the
-mm/--image/--audiooptions.
