wikis / llama.cpp / wiki / summaries / gguf-spec.md view as markdown

GGUF File Format Specification

type: summaryconfidence: highupdated: 2026-05-30llama_build: master (~2026-05)sources: 1

Key Points

GGUF is a binary, single-file, mmap-compatible model format for GGML-based inference; successor to GGML, GGMF, and GGJT. Its core difference from GGJT is a typed key-value metadata structure (instead of a list of untyped hyperparameters), making it extensible without breaking compatibility.
Models are little-endian by default; big-endian variants exist (introduced in v3). There is currently no in-file flag to detect endianness — assume little-endian if unspecified.
File layout: gguf_header_t (magic GGUF = bytes 0x47 0x47 0x55 0x46; version uint32; tensor_count uint64; metadata_kv_count uint64; then the KV pairs) → gguf_tensor_info_t[] → padding → tensor_data[].
Global alignment is set by general.alignment (uint32, must be a multiple of 8; default 32 if absent). Tensor data offsets must be multiples of ALIGNMENT; align_offset(offset) = offset + (ALIGNMENT - (offset % ALIGNMENT)) % ALIGNMENT.
Metadata keys must be valid ASCII, hierarchical lower_snake_case segments separated by ., at most 65535 (2^16-1) bytes. Tensor names must be standard GGUF strings at most 64 bytes long. Tensors currently have at most 4 dimensions.
gguf_metadata_value_type enum (uint32): UINT8=0, INT8=1, UINT16=2, INT16=3, UINT32=4, INT32=5, FLOAT32=6, BOOL=7 (1-byte, 0=false/1=true), STRING=8, ARRAY=9, UINT64=10, INT64=11, FLOAT64=12. Strings are UTF-8, non-null-terminated, length-prepended (uint64 len). Counts/lengths are uint64 by convention; readers should also support uint32.
ggml_type enum (uint32) values include: F32=0, F16=1, Q4_0=2, Q4_1=3, Q5_0=6, Q5_1=7, Q8_0=8, Q8_1=9, Q2_K=10, Q3_K=11, Q4_K=12, Q5_K=13, Q6_K=14, Q8_K=15, IQ2_XXS=16, IQ2_XS=17, IQ3_XXS=18, IQ1_S=19, IQ4_NL=20, IQ3_S=21, IQ2_S=22, IQ4_XS=23, I8=24, I16=25, I32=26, I64=27, F64=28, IQ1_M=29, BF16=30, TQ1_0=34, TQ2_0=35, MXFP4=39, GGML_TYPE_COUNT=40. Removed: Q4_2=4, Q4_3=5; the Q4_0_4_4/4_8/8_8 (31–33) and IQ4_NL_4_4/4_8/8_8 (36–38) repacked types are no longer stored in GGUF files.
Required general metadata: general.architecture (string, [a-z0-9]+; known values llama, mpt, gptneox, gptj, gpt2, bloom, falcon, mamba, rwkv), general.quantization_version (uint32; required only if any tensor is quantized), general.alignment (uint32).
general.file_type (uint32) enumerates the majority tensor type: ALL_F32=0, MOSTLY_F16=1, MOSTLY_Q4_0=2, MOSTLY_Q4_1=3, MOSTLY_Q4_1_SOME_F16=4, MOSTLY_Q8_0=7, MOSTLY_Q5_0=8, MOSTLY_Q5_1=9, MOSTLY_Q2_K=10, MOSTLY_Q3_K_S=11, MOSTLY_Q3_K_M=12, MOSTLY_Q3_K_L=13, MOSTLY_Q4_K_S=14, MOSTLY_Q4_K_M=15, MOSTLY_Q5_K_S=16, MOSTLY_Q5_K_M=17, MOSTLY_Q6_K=18 (Q4_2=5/Q4_3=6 removed). This list visibly lags the ggml_type enum (no IQ-series, Q8_K, etc.).
Per-architecture LLM keys use an [llm]. prefix: context_length (n_ctx), embedding_length (n_embd), block_count, feed_forward_length (n_ff), attention.head_count (n_head), attention.head_count_kv (GQA), rope.dimension_count, rope.freq_base, rope.scaling.{type,factor,original_context_length,finetuned} (type ∈ none/linear/yarn), expert_count/expert_used_count (MoE), and SSM keys for Mamba (ssm.conv_kernel, ssm.inner_size, ssm.state_size, ssm.time_step_rank).
Tokenizer metadata under tokenizer.ggml.*: model (llama/replit/gpt2/rwkv), tokens, scores, token_type (1=normal, 2=unknown, 3=control, 4=user defined, 5=unused, 6=byte), merges, added_tokens, plus special-token IDs (bos_token_id, eos_token_id, unknown_token_id, separator_token_id, padding_token_id). Chat template via tokenizer.chat_template (Jinja); full HF tokenizer via tokenizer.huggingface.json.
GGUF filename naming convention: [<Sidecar>]<BaseName><SizeLabel><FineTune><Version><Encoding><Type><Shard>.gguf, --delimited. Sidecars: mmproj (multimodal projector), mtp (multi-token-prediction draft module). Type values: LoRA, vocab (default = tensor model). Shards formatted <NNNNN>-of-<NNNNN>, zero-padded 5 digits, starting at 00001. Minimum valid set = BaseName + SizeLabel + Version. SizeLabel scale prefixes: Q(uadrillion), T(rillion), B(illion), M(illion), K(thousand).
Standardized tensor names: base layers token_embd, pos_embd, output_norm, output; per-block blk.N.BB with attn_norm, attn_q/k/v, attn_qkv, attn_output, ffn_norm, ffn_up/gate/down, plus MoE variants ffn_gate_inp, ffn_{gate,down,up}_exp, and SSM tensors ssm_in, ssm_conv1d, ssm_x, ssm_a, ssm_d, ssm_dt, ssm_out.
Format version history: v1 initial; v2 widened countable values from uint32 to uint64; v3 added big-endian support. Spec version field must be 3.

Relevant Concepts

gguf format — this document is the canonical GGUF binary layout and metadata spec
quantization — the ggml_type and general.file_type enums enumerate all quant types GGUF can carry
kv cache and context — RoPE/YaRN scaling and context-length metadata keys live here (rope.scaling.*, context_length)

Source Metadata

Type: official documentation (mirror)
Repo/path: ggml-org/ggml + docs/gguf.md
Fetched: 2026-05-30 from master
URL: https://github.com/ggml-org/ggml/blob/master/docs/gguf.md