Glossary

Quantisation

Quantisation reduces the numerical precision of a neural network's weights and activations, trading a small amount of accuracy for substantial reductions in memory and compute cost. Models trained in 32-bit floating point (FP32) can often be converted to 16-bit (FP16 or BF16) with negligible quality loss—indeed, most modern training uses mixed precision from the start. Post-training quantisation to 8-bit integers (INT8) typically preserves most quality while halving memory compared to FP16.

More aggressive quantisation to 4 bits or lower (using techniques like GPTQ, AWQ, GGML, and bitsandbytes) can reduce memory by another factor of two or four. The key insight is that the distribution of weights and activations in transformer models contains many values near zero, which can be represented with few bits, while a small number of outliers require higher precision. Mixed-precision and group-wise quantisation schemes exploit this structure to minimise quality loss.

Quantisation is essential for deploying large models on resource-constrained hardware. A 70-billion-parameter model in FP32 requires 280GB of memory; in INT4 it fits in 35GB, making it serveable on a single high-end GPU or on multiple consumer GPUs. Quantisation also speeds up inference: lower-precision operations consume less memory bandwidth and can often be computed faster with specialised hardware instructions. Along with distillation (training a smaller student to mimic a larger teacher) and pruning (removing unnecessary parameters), quantisation is a core technique in the efficient AI toolkit that makes modern LLMs practical to deploy.

Related terms: Knowledge Distillation, Pruning

Discussed in:

Also defined in: Textbook of AI