Quantisation for Inference, Glossary, Textbook of AI

Inference quantisation replaces high-precision tensors (BF16, FP32) with lower-precision representations to save HBM bandwidth, HBM capacity, and (with hardware support) FLOP-time.

Formats:

INT8: 8-bit integer with a per-tensor or per-channel scale factor $s$ such that $w \approx s \cdot q$ where $q \in [-128, 127]$. Available on every accelerator since Turing.
FP8: two variants, E4M3 (4-bit exponent, 3-bit mantissa, range ±448, used for weights/activations) and E5M2 (5-bit exponent, 2-bit mantissa, range ±57344, used for gradients). Hopper and later.
INT4: 4-bit integer, range $[-8, 7]$. GPTQ, AWQ, SmoothQuant target this. Used heavily for weight-only quantisation of open-weight LLMs.
FP4: E2M1 4-bit floating point. Blackwell hardware support; emerging algorithm work (NVFP4).

Granularity of scaling:

Per-tensor: one scale per tensor. Cheapest, worst quality on activations.
Per-channel (per row of $W$ or per output feature): one scale per output channel. Standard for weights.
Per-group (e.g. groups of 128 elements): one scale per group, used by GPTQ/AWQ; trades a small bit-overhead for big quality gains at INT4.
Per-token for activations: one scale per token along the sequence dimension.

What gets quantised:

Weight-only quantisation (INT4 weights, BF16 activations): keeps activations in BF16, dequantises weights on the fly. Almost-free quality, big bandwidth saving, weights are the dominant HBM traffic in batch-1 decode. Popular for consumer LLM serving (llama.cpp, AWQ, GPTQ).
Weight-and-activation quantisation (W8A8, W4A8, FP8): both quantised, matmul executes in low precision on tensor cores. Required for compute-bound prefill and high-batch decode.
KV-cache quantisation: KV cache stored in INT8, INT4 or FP8. Huge memory savings at long context: a 128 k-token Llama 3 70B KV cache in BF16 is ~80 GB; INT4 brings it to 20 GB. Per-channel scaling in the head dimension preserves quality.

Throughput gains:

INT8 vs BF16: 2× tensor-core throughput, 2× HBM bandwidth saving, typical end-to-end speedup 1.5–1.8×.
FP8 vs BF16: 2× throughput on Hopper, ~1.7× end-to-end.
INT4/FP4 vs BF16: 4× throughput, 4× memory; weight-only INT4 gives ~2.5× end-to-end on H100.
2:4 sparsity + quantisation: another 2×, so combined ~4–8× throughput vs dense BF16.

Quality preservation: with care (calibration set ≥ 512 samples, per-channel weights, per-token activations, outlier handling), INT8 W8A8 typically loses <0.5 % on standard LLM benchmarks; FP8 loses <0.3 %; INT4 weight-only loses 0.5–1 %; INT4 W4A4 loses 2–4 % and is rarely worth it.

Outlier problem: a few activation channels in transformers carry magnitudes 100× larger than the rest (Dettmers et al., LLM.int8()). Naive per-tensor INT8 collapses on these. SmoothQuant migrates the difficulty from activations to weights via $X' = X / s, W' = sW$. AWQ (Activation-aware Weight Quantisation) preserves the top 1 % of channels at higher precision. GPTQ uses second-order information to round weights to minimise output error.

Production systems: TensorRT-LLM, vLLM, SGLang, llama.cpp all support FP8/INT8/INT4 weight loading. Open-weight model releases now ship official quantised variants alongside BF16 (e.g. Llama 3.1 in FP8 from Meta, Qwen2 in INT4 GPTQ).

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).