Quantisation, Glossary, Textbook of AI

Quantisation maps high-precision floating-point tensors to low-bit representations (INT8, INT4, FP8, NF4) for the purpose of shrinking memory footprint and increasing arithmetic throughput. A 70B-parameter model in BF16 occupies 140 GB; the same model in INT4 fits in 35 GB and runs on a single 48 GB GPU. The mathematics is a simple affine map, but the engineering, choosing scales, handling outliers, preserving accuracy, is delicate.

The basic operation is uniform affine quantisation:

$$q = \mathrm{round}\!\left(\frac{x - z}{s}\right), \qquad \hat{x} = s \cdot q + z,$$

where $s$ (scale) and $z$ (zero point) are chosen so that the integer range $[q_\mathrm{min}, q_\mathrm{max}]$ covers the value range of $x$. Symmetric quantisation sets $z = 0$ and $s = \max|x| / q_\mathrm{max}$; asymmetric uses $z = \mathrm{round}(-\min(x)/s)$ to handle skewed distributions like ReLU activations. Quantisation granularity ranges from per-tensor (one scale for the whole tensor) to per-channel (one scale per output channel) to per-group (one scale per 64- or 128-element block), finer granularity preserves more accuracy at the cost of metadata overhead.

PTQ vs QAT. Post-training quantisation (PTQ) takes a trained FP16 model, calibrates scales on a small dataset, and converts the weights without further training. Quantisation-aware training (QAT) inserts fake-quantise nodes into the graph during training, so the model learns to be robust to the rounding error. PTQ is the default for LLM inference because retraining a 70B model is prohibitively expensive; QAT is preferred for smaller mobile models where every accuracy point matters.

Modern LLM quantisation methods address the central pathology that LLM activations contain rare but large outliers in a few channels, which expand the per-tensor scale and crush precision on the typical channels:

GPTQ (Frantar et al., 2022): layer-wise weight-only quantisation that uses second-order information from the layer's Hessian to round each weight in the direction that minimises output error. See gptq for details.
AWQ (Activation-aware Weight Quantisation, Lin et al., 2023): observes that not all weights matter equally, channels with large activation magnitude should be preserved at higher precision. AWQ scales these channels up before quantisation and down after, keeping their effective bit-width higher.
SmoothQuant (Xiao et al., 2022): redistributes the quantisation difficulty between weights and activations by applying a per-channel scaling $s$ that smooths activation outliers while compensating in the weights, enabling INT8 quantisation of both.
NF4 (NormalFloat-4, used in QLoRA): a non-uniform 4-bit format whose levels are spaced as the quantiles of a standard normal distribution, optimal for weights that are themselves approximately Gaussian.

For inference the quantised matmul $y = W x$ is implemented as $y = s_W (q_W x)$ for weight-only quantisation, or as $y = s_W s_x (q_W q_x)$ when activations are also quantised. The integer matmul runs on dedicated hardware (NVIDIA's INT8 tensor cores deliver $4\times$ FP16 throughput); dequantisation is fused into the next layer's input scaling.

The accuracy degradation from 4-bit quantisation on well-trained 70B+ models is typically under one perplexity point or a few percent on downstream benchmarks, small enough that almost all open-source LLM inference now ships in 4-bit. Below 4 bits, accuracy degrades sharply unless QAT or specialised methods (AQLM, QuIP#) are used.

Related terms: GPTQ, Mixed Precision Training, KV Cache, vLLM, Knowledge Distillation

Discussed in:

Chapter 15: Modern AI, Engineering at Scale

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.