Glossary

Pruning

Pruning removes unnecessary parameters or structures from a trained neural network to reduce its size and computational cost. Unstructured pruning sets individual weights to zero based on magnitude or gradient, producing sparse weight matrices. Structured pruning removes entire neurons, attention heads, channels, or layers, producing models that run faster on standard hardware without requiring sparse-matrix support.

The simplest pruning heuristic is magnitude-based: remove the smallest weights. This often works surprisingly well because trained networks typically have many near-zero weights. More sophisticated methods use gradient information, Hessian approximations, or optimal brain surgeon/damage techniques to estimate the importance of each weight. Iterative pruning alternates between pruning and fine-tuning: remove some weights, continue training to recover accuracy, repeat. This can often prune 90% or more of weights with minimal accuracy loss.

The Lottery Ticket Hypothesis (Frankle and Carlin, 2019) observed that large networks contain small subnetworks that, if identified and trained in isolation from the same initialisation, can match the full network's performance. This provides theoretical motivation for why pruning works and has inspired practical algorithms for finding efficient subnetworks. Along with quantisation and distillation, pruning is a core technique for deploying AI models under resource constraints. Combined approaches—prune then quantise then distill—can dramatically reduce model footprint while preserving most of the capabilities of the original large model.

Related terms: Quantisation, Knowledge Distillation

Discussed in:

Also defined in: Textbook of AI