wikis / llama.cpp / wiki / entities / ggml.md view as markdown
ggml (Tensor Library)
Overview
ggml is a tensor library written in C by Georgi Gerganov โ a minimal, dependency-free machine-learning library built for efficient model inference. It is the low-level compute engine that sits underneath llama.cpp: where llama.cpp handles model architectures, tokenizers, sampling, and the user-facing binaries, ggml does the actual math โ defining tensors, building the compute graph, executing operations on each hardware backend, and providing the integer quantization (quantization) that shrinks models. The name combines the author's initials (GG) with "ML" (machine learning).
llama.cpp is described in its own README as "the main playground for developing features of the ggml library" โ the two are co-developed. The same library also powers whisper.cpp (speech-to-text) and other inference projects, so ggml is a general-purpose foundation, not Llama-specific.
Characteristics
- Plain C, minimal dependencies โ designed to compile and run almost anywhere.
- The backend system lives here. ggml implements the compute backends that llama.cpp exposes: CPU (with SIMD โ AVX/NEON/etc.), CUDA, Metal, Vulkan, ROCm/HIP, SYCL, and more. See build and backends.
- Quantization โ ggml defines the quantized tensor types (the
ggml_typeenum: Q4_K_M, Q8_0, the IQ-series, etc.) used by quantization. - GGUF is ggml's file format. GGUF (the "GG" stands for ggml) is the single-file container that stores a model's weights, metadata, and tokenizer for ggml/llama.cpp to load. The spec actually lives in the
ggml-org/ggmlrepo, not the llama.cpp repo. - Compute graph + memory efficiency โ builds an explicit graph of operations and is optimized for low memory use (mmap loading, no heavy framework runtime).
How to Use
Most users never touch ggml directly โ they use it through llama.cpp (or a wrapper like Ollama/LM Studio). You encounter ggml in three practical ways:
- Choosing a backend at build time selects which ggml backend is compiled in (
-DGGML_CUDA=ON,-DGGML_METAL,-DGGML_VULKAN=ON, โฆ) โ these are ggml flags. See build and backends. - As a model format โ every GGUF file you download is a ggml-format container (gguf format).
- As a C library โ developers can build directly against ggml's C API for custom inference; the
ggml-org/ggmlrepo is the standalone home of the library.
Affiliation note (confidence: medium): in 2026, Georgi Gerganov's ggml.ai joined Hugging Face (per a llama.cpp Discussions announcement). ggml and llama.cpp remain open-source under ggml-org.
Related Entities
- project llama cpp โ the inference project built on ggml
- gguf format โ ggml's model file format
- quantization โ the quantized tensor types ggml defines
- build and backends โ the hardware backends ggml implements
