Inference quantisation replaces high-precision tensors (BF16, FP32) with lower-precision representations to save HBM bandwidth, HBM capacity, and (with hardware support) FLOP-time.
Formats:
- INT8: 8-bit integer with a per-tensor or per-channel scale factor $s$ such that $w \approx s \cdot q$ where $q \in [-128, 127]$. Available on every accelerator since Turing.
- FP8: two variants, E4M3 (4-bit exponent, 3-bit mantissa, range ±448, used for weights/activations) and E5M2 (5-bit exponent, 2-bit mantissa, range ±57344, used for gradients). Hopper and later.
- INT4: 4-bit integer, range $[-8, 7]$. GPTQ, AWQ, SmoothQuant target this. Used heavily for weight-only quantisation of open-weight LLMs.
- FP4: E2M1 4-bit floating point. Blackwell hardware support; emerging algorithm work (NVFP4).
Granularity of scaling:
- Per-tensor: one scale per tensor. Cheapest, worst quality on activations.
- Per-channel (per row of $W$ or per output feature): one scale per output channel. Standard for weights.
- Per-group (e.g. groups of 128 elements): one scale per group, used by GPTQ/AWQ; trades a small bit-overhead for big quality gains at INT4.
- Per-token for activations: one scale per token along the sequence dimension.
What gets quantised:
- Weight-only quantisation (INT4 weights, BF16 activations): keeps activations in BF16, dequantises weights on the fly. Almost-free quality, big bandwidth saving, weights are the dominant HBM traffic in batch-1 decode. Popular for consumer LLM serving (llama.cpp, AWQ, GPTQ).
- Weight-and-activation quantisation (W8A8, W4A8, FP8): both quantised, matmul executes in low precision on tensor cores. Required for compute-bound prefill and high-batch decode.
- KV-cache quantisation: KV cache stored in INT8, INT4 or FP8. Huge memory savings at long context: a 128 k-token Llama 3 70B KV cache in BF16 is ~80 GB; INT4 brings it to 20 GB. Per-channel scaling in the head dimension preserves quality.
Throughput gains:
- INT8 vs BF16: 2× tensor-core throughput, 2× HBM bandwidth saving, typical end-to-end speedup 1.5–1.8×.
- FP8 vs BF16: 2× throughput on Hopper, ~1.7× end-to-end.
- INT4/FP4 vs BF16: 4× throughput, 4× memory; weight-only INT4 gives ~2.5× end-to-end on H100.
- 2:4 sparsity + quantisation: another 2×, so combined ~4–8× throughput vs dense BF16.
Quality preservation: with care (calibration set ≥ 512 samples, per-channel weights, per-token activations, outlier handling), INT8 W8A8 typically loses <0.5 % on standard LLM benchmarks; FP8 loses <0.3 %; INT4 weight-only loses 0.5–1 %; INT4 W4A4 loses 2–4 % and is rarely worth it.
Outlier problem: a few activation channels in transformers carry magnitudes 100× larger than the rest (Dettmers et al., LLM.int8()). Naive per-tensor INT8 collapses on these. SmoothQuant migrates the difficulty from activations to weights via $X' = X / s, W' = sW$. AWQ (Activation-aware Weight Quantisation) preserves the top 1 % of channels at higher precision. GPTQ uses second-order information to round weights to minimise output error.
Production systems: TensorRT-LLM, vLLM, SGLang, llama.cpp all support FP8/INT8/INT4 weight loading. Open-weight model releases now ship official quantised variants alongside BF16 (e.g. Llama 3.1 in FP8 from Meta, Qwen2 in INT4 GPTQ).
Related terms: Tensor Cores, KV Cache, vLLM, PagedAttention, Mixed Precision Training, Quantisation
Discussed in:
- Chapter 15: Modern AI, Modern AI