wikis / llama.cpp / wiki / entities / ggml.md view as markdown

ggml (Tensor Library)

type: entityconfidence: highupdated: 2026-05-30llama_build: master (~2026-05)sources: 2

Overview

ggml is a tensor library written in C by Georgi Gerganov — a minimal, dependency-free machine-learning library built for efficient model inference. It is the low-level compute engine that sits underneath llama.cpp: where llama.cpp handles model architectures, tokenizers, sampling, and the user-facing binaries, ggml does the actual math — defining tensors, building the compute graph, executing operations on each hardware backend, and providing the integer quantization (quantization) that shrinks models. The name combines the author's initials (GG) with "ML" (machine learning).

llama.cpp is described in its own README as "the main playground for developing features of the ggml library" — the two are co-developed. The same library also powers whisper.cpp (speech-to-text) and other inference projects, so ggml is a general-purpose foundation, not Llama-specific.

Characteristics

Plain C, minimal dependencies — designed to compile and run almost anywhere.
The backend system lives here. ggml implements the compute backends that llama.cpp exposes: CPU (with SIMD — AVX/NEON/etc.), CUDA, Metal, Vulkan, ROCm/HIP, SYCL, and more. See build and backends.
Quantization — ggml defines the quantized tensor types (the ggml_type enum: Q4_K_M, Q8_0, the IQ-series, etc.) used by quantization.
GGUF is ggml's file format. GGUF (the "GG" stands for ggml) is the single-file container that stores a model's weights, metadata, and tokenizer for ggml/llama.cpp to load. The spec actually lives in the ggml-org/ggml repo, not the llama.cpp repo.
Compute graph + memory efficiency — builds an explicit graph of operations and is optimized for low memory use (mmap loading, no heavy framework runtime).

How to Use

Most users never touch ggml directly — they use it through llama.cpp (or a wrapper like Ollama/LM Studio). You encounter ggml in three practical ways:

Choosing a backend at build time selects which ggml backend is compiled in (-DGGML_CUDA=ON, -DGGML_METAL, -DGGML_VULKAN=ON, …) — these are ggml flags. See build and backends.
As a model format — every GGUF file you download is a ggml-format container (gguf format).
As a C library — developers can build directly against ggml's C API for custom inference; the ggml-org/ggml repo is the standalone home of the library.

Affiliation note (confidence: medium): in 2026, Georgi Gerganov's ggml.ai joined Hugging Face (per a llama.cpp Discussions announcement). ggml and llama.cpp remain open-source under ggml-org.

Related Entities

project llama cpp — the inference project built on ggml
gguf format — ggml's model file format
quantization — the quantized tensor types ggml defines
build and backends — the hardware backends ggml implements