Glossary

Pruning

Pruning removes parameters from a trained neural network, exploiting the empirical fact that large networks are heavily over-parameterised , most weights can be set to zero with little impact on accuracy if done carefully. The technique dates to LeCun's Optimal Brain Damage (1990), saw a renaissance with Han et al.'s Deep Compression (2015), and has become a standard tool in production LLM deployment alongside quantisation and knowledge-distillation.

The simplest prescription is magnitude pruning: zero out all weights with absolute value below a threshold $\tau$,

$$W_{ij}' = \begin{cases} W_{ij} & \text{if } |W_{ij}| > \tau \\ 0 & \text{otherwise} \end{cases},$$

where $\tau$ is chosen to achieve a target sparsity (e.g. the threshold producing 90% zeros). The justification is that small weights contribute less to the network's outputs and so are safest to remove. Pruning is followed by fine-tuning on the original training data to let the surviving weights compensate for those removed, sparsity without fine-tuning typically destroys accuracy beyond 20–30%.

Pruning splits along two orthogonal axes:

  • Unstructured pruning zeros individual weights, producing a sparse matrix that retains the original shape. Maximal flexibility and minimal accuracy loss, but only useful if the inference runtime supports sparse matmul (NVIDIA Ampere's 2:4 sparsity, where two of every four weights are zero, gives a 2× hardware speedup).
  • Structured pruning removes entire neurons, attention heads, or transformer layers, producing a smaller dense network. Less flexible (accuracy drops faster) but immediately deployable on standard hardware with no special kernel.

Iterative magnitude pruning alternates between small pruning steps and fine-tuning rather than pruning to the target sparsity in one step. This typically reaches much higher final sparsity at the same accuracy: the network has time to redistribute representations across the surviving weights.

The lottery ticket hypothesis (Frankle & Carbin, 2019) is the best-known modern result. They showed that a randomly initialised dense network contains a small sparse subnetwork, the winning ticket, which, when trained from the original initialisation, matches the dense network's accuracy. Finding the ticket requires the iterative pruning procedure: train the dense network, prune the smallest weights, reset the survivors to their original initial values, retrain. Tickets can reach 90%+ sparsity on vision benchmarks. The hypothesis reframes training as a search over subnetworks rather than over weights, and it has driven much of the modern interest in pruning.

For LLMs specifically, magnitude pruning struggles. Frantar & Alistarh's SparseGPT (2023) addresses this with a layer-wise method analogous to gptq: for each linear layer, the optimal mask and the optimal weight update for the surviving entries are computed jointly using approximate second-order information, minimising the layer-output reconstruction error. SparseGPT achieves 50% unstructured sparsity on OPT-175B with negligible perplexity increase, in one-shot fashion (no fine-tuning needed). Wanda (Sun et al., 2023) is a simpler variant scoring weights by $|W_{ij}| \cdot \|x_j\|_2$, magnitude weighted by activation norm, which achieves comparable results without any second-order solve.

Structured pruning of LLMs is harder. Removing whole heads, FFN columns, or layers tends to lose accuracy quickly because each transformer block is already efficient. Sheared Llama (Xia et al., 2023) and LLM-Pruner (Ma et al., 2023) succeed by combining structured pruning with a substantial post-pruning continued-pretraining phase, essentially distilling the dense model into a smaller dense one with the surviving structure as initialisation.

The practical pipeline for production LLM deployment now typically chains knowledge-distillation (large dense teacher → smaller dense student), pruning (student → sparse student), and quantisation (sparse student → 4-bit), achieving 8–20× compression with under one perplexity point of degradation.

Related terms: Knowledge Distillation, Quantisation, GPTQ, Transformer

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.