GPTQ, Glossary, Textbook of AI

GPTQ (Generative Pre-trained Transformer Quantisation; Frantar, Ashkboos, Hoefler & Alistarh, 2022) is a post-training quantisation algorithm tailored to large language models. It quantises the weights of each linear layer in a transformer to 3 or 4 bits per parameter while losing very little perplexity, typically under one point on Wikitext-2 for models from OPT-175B to Llama-2-70B. GPTQ runs in a single pass (no fine-tuning, no calibration set retraining) and made open-source LLM inference on consumer GPUs practical: a 4-bit GPTQ Llama-2-70B fits in 35 GB and runs interactively on a single 48 GB GPU.

The mathematical foundation is the Optimal Brain Quantisation (OBQ) framework, itself descended from LeCun's Optimal Brain Surgeon. Consider a single linear layer $y = W x$ trained to produce activations $\hat{y}$ on calibration inputs $X = [x_1, \dots, x_N]$. Given a quantised approximation $\hat{W}$, the layer-output reconstruction error is

$$\mathcal{L}(\hat{W}) = \|W X - \hat{W} X\|_F^2.$$

The Hessian of this objective with respect to $\hat{W}$ is $H = 2 X X^\top$, a $d_\mathrm{in} \times d_\mathrm{in}$ matrix that captures the input statistics. If we round one weight $W_{ij}$ to its quantised value $\hat{W}_{ij}$, the optimal update to all other weights in the same row $i$ that minimises $\mathcal{L}$ is

$$\delta W_{i,k} = -\frac{W_{ij} - \hat{W}_{ij}}{[H^{-1}]_{jj}} \cdot [H^{-1}]_{jk} \quad \text{for } k \neq j.$$

OBQ applied this update one weight at a time, each time picking the weight whose rounding error contributes least to the residual loss. For a $d \times d$ weight matrix this requires $O(d^4)$ work per layer, infeasible at LLM scale.

GPTQ's three innovations make this practical:

Arbitrary order. OBQ chose the next weight to quantise greedily based on Hessian information; GPTQ shows that for large enough models, a fixed left-to-right column order is essentially as good. This eliminates the per-step Hessian-based selection.
Lazy batched updates. Rather than updating the entire weight matrix after every individual quantisation, GPTQ processes columns in blocks of 128 and accumulates updates across the block, reducing memory traffic by two orders of magnitude.
Cholesky reformulation. The required $H^{-1}$ entries can be obtained from the Cholesky factor $L$ of $H$, which is numerically stable and can be computed once per layer. The per-block update reduces to a single triangular solve.

The algorithm's complexity drops to $O(d^3)$ per layer, quantising a 70B-parameter model takes a few hours on one GPU.

The quantisation grid itself is standard symmetric uniform quantisation per output channel:

$$q = \mathrm{round}\!\left(\frac{w}{s}\right), \qquad s = \frac{\max|w|}{2^{b-1} - 1},$$

with $b = 4$ bits the typical setting. GPTQ's contribution is not the grid itself but the joint optimisation of the rounding direction and the compensating updates to as-yet-unquantised weights, ensuring that each rounding error is partially absorbed by neighbouring weights rather than left to corrupt the layer output.

GPTQ's practical impact has been enormous. The reference implementation and the AutoGPTQ library quantise any Hugging Face transformer with a few lines of Python; quantised checkpoints for essentially every open LLM (Llama, Qwen, Mistral, Gemma) ship as *-GPTQ repositories. ExLlamaV2 and vllm both ship GPTQ-aware kernels that fuse dequantisation into the matmul, achieving close to BF16 throughput.

Competing methods include AWQ (which scales activation-significant channels before quantising), SmoothQuant (which redistributes outliers between weights and activations), and NF4 / QLoRA (which uses a 4-bit non-uniform grid for fine-tuning). For pure inference, GPTQ and AWQ are roughly tied on accuracy; GPTQ tends to win on 3-bit and below where its second-order compensation matters most.

Discussed in:

Chapter 15: Modern AI, Engineering at Scale

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).