wikis / llama.cpp / wiki / summaries / gguf-spec.md view as markdown
GGUF File Format Specification
Key Points
- GGUF is a binary, single-file,
mmap-compatible model format for GGML-based inference; successor to GGML, GGMF, and GGJT. Its core difference from GGJT is a typed key-value metadata structure (instead of a list of untyped hyperparameters), making it extensible without breaking compatibility. - Models are little-endian by default; big-endian variants exist (introduced in v3). There is currently no in-file flag to detect endianness โ assume little-endian if unspecified.
- File layout:
gguf_header_t(magicGGUF= bytes0x47 0x47 0x55 0x46;versionuint32;tensor_countuint64;metadata_kv_countuint64; then the KV pairs) โgguf_tensor_info_t[]โ padding โtensor_data[]. - Global alignment is set by
general.alignment(uint32, must be a multiple of 8; default32if absent). Tensor data offsets must be multiples ofALIGNMENT;align_offset(offset) = offset + (ALIGNMENT - (offset % ALIGNMENT)) % ALIGNMENT. - Metadata keys must be valid ASCII, hierarchical
lower_snake_casesegments separated by., at most 65535 (2^16-1) bytes. Tensor names must be standard GGUF strings at most 64 bytes long. Tensors currently have at most 4 dimensions. gguf_metadata_value_typeenum (uint32): UINT8=0, INT8=1, UINT16=2, INT16=3, UINT32=4, INT32=5, FLOAT32=6, BOOL=7 (1-byte, 0=false/1=true), STRING=8, ARRAY=9, UINT64=10, INT64=11, FLOAT64=12. Strings are UTF-8, non-null-terminated, length-prepended (uint64 len). Counts/lengths areuint64by convention; readers should also supportuint32.ggml_typeenum (uint32) values include: F32=0, F16=1, Q4_0=2, Q4_1=3, Q5_0=6, Q5_1=7, Q8_0=8, Q8_1=9, Q2_K=10, Q3_K=11, Q4_K=12, Q5_K=13, Q6_K=14, Q8_K=15, IQ2_XXS=16, IQ2_XS=17, IQ3_XXS=18, IQ1_S=19, IQ4_NL=20, IQ3_S=21, IQ2_S=22, IQ4_XS=23, I8=24, I16=25, I32=26, I64=27, F64=28, IQ1_M=29, BF16=30, TQ1_0=34, TQ2_0=35, MXFP4=39, GGML_TYPE_COUNT=40. Removed: Q4_2=4, Q4_3=5; the Q4_0_4_4/4_8/8_8 (31โ33) and IQ4_NL_4_4/4_8/8_8 (36โ38) repacked types are no longer stored in GGUF files.- Required general metadata:
general.architecture(string,[a-z0-9]+; known valuesllama,mpt,gptneox,gptj,gpt2,bloom,falcon,mamba,rwkv),general.quantization_version(uint32; required only if any tensor is quantized),general.alignment(uint32). general.file_type(uint32) enumerates the majority tensor type: ALL_F32=0, MOSTLY_F16=1, MOSTLY_Q4_0=2, MOSTLY_Q4_1=3, MOSTLY_Q4_1_SOME_F16=4, MOSTLY_Q8_0=7, MOSTLY_Q5_0=8, MOSTLY_Q5_1=9, MOSTLY_Q2_K=10, MOSTLY_Q3_K_S=11, MOSTLY_Q3_K_M=12, MOSTLY_Q3_K_L=13, MOSTLY_Q4_K_S=14, MOSTLY_Q4_K_M=15, MOSTLY_Q5_K_S=16, MOSTLY_Q5_K_M=17, MOSTLY_Q6_K=18 (Q4_2=5/Q4_3=6 removed). This list visibly lags theggml_typeenum (no IQ-series, Q8_K, etc.).- Per-architecture LLM keys use an
[llm].prefix:context_length(n_ctx),embedding_length(n_embd),block_count,feed_forward_length(n_ff),attention.head_count(n_head),attention.head_count_kv(GQA),rope.dimension_count,rope.freq_base,rope.scaling.{type,factor,original_context_length,finetuned}(type โ none/linear/yarn),expert_count/expert_used_count(MoE), and SSM keys for Mamba (ssm.conv_kernel,ssm.inner_size,ssm.state_size,ssm.time_step_rank). - Tokenizer metadata under
tokenizer.ggml.*:model(llama/replit/gpt2/rwkv),tokens,scores,token_type(1=normal, 2=unknown, 3=control, 4=user defined, 5=unused, 6=byte),merges,added_tokens, plus special-token IDs (bos_token_id,eos_token_id,unknown_token_id,separator_token_id,padding_token_id). Chat template viatokenizer.chat_template(Jinja); full HF tokenizer viatokenizer.huggingface.json. - GGUF filename naming convention:
[<Sidecar>]<BaseName><SizeLabel><FineTune><Version><Encoding><Type><Shard>.gguf,--delimited. Sidecars:mmproj(multimodal projector),mtp(multi-token-prediction draft module). Type values:LoRA,vocab(default = tensor model). Shards formatted<NNNNN>-of-<NNNNN>, zero-padded 5 digits, starting at00001. Minimum valid set = BaseName + SizeLabel + Version. SizeLabel scale prefixes: Q(uadrillion), T(rillion), B(illion), M(illion), K(thousand). - Standardized tensor names: base layers
token_embd,pos_embd,output_norm,output; per-blockblk.N.BBwithattn_norm,attn_q/k/v,attn_qkv,attn_output,ffn_norm,ffn_up/gate/down, plus MoE variantsffn_gate_inp,ffn_{gate,down,up}_exp, and SSM tensorsssm_in,ssm_conv1d,ssm_x,ssm_a,ssm_d,ssm_dt,ssm_out. - Format version history: v1 initial; v2 widened countable values from uint32 to uint64; v3 added big-endian support. Spec version field must be
3.
Relevant Concepts
- gguf format โ this document is the canonical GGUF binary layout and metadata spec
- quantization โ the
ggml_typeandgeneral.file_typeenums enumerate all quant types GGUF can carry - kv cache and context โ RoPE/YaRN scaling and context-length metadata keys live here (
rope.scaling.*,context_length)
Source Metadata
- Type: official documentation (mirror)
- Repo/path: ggml-org/ggml +
docs/gguf.md - Fetched: 2026-05-30 from master
- URL: https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
