wikis / llama.cpp / wiki / concepts / imatrix.md view as markdown
Importance Matrix (imatrix)
Definition
An importance matrix (imatrix) is a set of per-weight importance statistics. It is gathered by running the full-precision (f16) model over a body of calibration text and recording which weights matter most. During quantization, binary llama quantize uses these statistics to preserve the most important weights, which markedly improves quality at low bit widths โ especially for the IQ (i-quant) types.
How It Works
The matrix is produced by binary imatrix running over a calibration corpus, and is later consumed by binary llama quantize via --imatrix.
Statistics are computed on squared activations. Reported quantities (available via --show-statistics) include:
- Sum(Act^2) โ sum of squared activations.
- %Active โ fraction of activations above a
1e-5threshold. - Entropy / E(norm) โ activation entropy.
- ZD Score โ see arXiv 2406.17415.
- CosSim โ cosine similarity versus the prior layer.
The default output format is GGUF. A legacy dat format is available via --output-format dat or by using a non-.gguf extension, and conversion is bidirectional. Multiple matrices can be merged by passing --in-file repeatedly.
Key Parameters
--imatrix FILE(onllama-quantize) โ the file that consumes the matrix during quantization.--process-output(defaultfalse) โ whether to apply the imatrix tooutput.weight. It is usually better NOT to, hence the default.--output-format {gguf,dat}โ output format selection; GGUF is the default.--in-fileโ repeatable, merges multiple matrices.
When To Use
Compute an imatrix before quantizing to a low-bit type โ it is effectively required for good IQ results and helps minimize both Perplexity (ppl) and KL-Divergence (kld). Larger, more representative calibration data yields a better matrix; a few hundred KB of varied text is a common choice.
Risks & Pitfalls
- A small or unrepresentative calibration corpus produces a weaker matrix โ use varied text.
- Applying the imatrix to
output.weightis usually counterproductive; leave--process-outputat its default offalseunless you have a reason not to.
Related Concepts
- quantization โ the process that consumes the imatrix.
- binary imatrix โ the tool that produces the matrix.
- binary llama quantize โ the tool that applies it via
--imatrix.
