wikis / llama.cpp / wiki / concepts / imatrix.md view as markdown

Importance Matrix (imatrix)

type: conceptconfidence: highupdated: 2026-05-30llama_build: master (~2026-05)sources: 2

Definition

An importance matrix (imatrix) is a set of per-weight importance statistics. It is gathered by running the full-precision (f16) model over a body of calibration text and recording which weights matter most. During quantization, binary llama quantize uses these statistics to preserve the most important weights, which markedly improves quality at low bit widths — especially for the IQ (i-quant) types.

How It Works

The matrix is produced by binary imatrix running over a calibration corpus, and is later consumed by binary llama quantize via --imatrix.

Statistics are computed on squared activations. Reported quantities (available via --show-statistics) include:

Sum(Act^2) — sum of squared activations.
%Active — fraction of activations above a 1e-5 threshold.
Entropy / E(norm) — activation entropy.
ZD Score — see arXiv 2406.17415.
CosSim — cosine similarity versus the prior layer.

The default output format is GGUF. A legacy dat format is available via --output-format dat or by using a non-.gguf extension, and conversion is bidirectional. Multiple matrices can be merged by passing --in-file repeatedly.

Key Parameters

--imatrix FILE (on llama-quantize) — the file that consumes the matrix during quantization.
--process-output (default false) — whether to apply the imatrix to output.weight. It is usually better NOT to, hence the default.
--output-format {gguf,dat} — output format selection; GGUF is the default.
--in-file — repeatable, merges multiple matrices.

When To Use

Compute an imatrix before quantizing to a low-bit type — it is effectively required for good IQ results and helps minimize both Perplexity (ppl) and KL-Divergence (kld). Larger, more representative calibration data yields a better matrix; a few hundred KB of varied text is a common choice.

Risks & Pitfalls

A small or unrepresentative calibration corpus produces a weaker matrix — use varied text.
Applying the imatrix to output.weight is usually counterproductive; leave --process-output at its default of false unless you have a reason not to.

Related Concepts

quantization — the process that consumes the imatrix.
binary imatrix — the tool that produces the matrix.
binary llama quantize — the tool that applies it via --imatrix.