The lottery ticket hypothesis was proposed by Frankle and Carbin (2019): a randomly-initialised, dense neural network contains a sparse subnetwork that, when trained from the same initialisation, achieves test accuracy comparable to the original dense network in the same number of training iterations.
Formally, let $f(x; \theta_0 \odot m)$ denote the network with mask $m \in \{0, 1\}^p$ applied to initial parameters $\theta_0$. The hypothesis asserts that there exists a mask $m^*$ with $\|m^*\|_0 \ll p$ such that training $f(x; \theta_0 \odot m^*)$ for $T$ steps yields accuracy on par with training $f(x; \theta_0)$ for $T$ steps.
Iterative magnitude pruning (IMP). The authors' algorithm to find winning tickets:
- Initialise $\theta_0$ randomly; set $m = \mathbf{1}$.
- Train $f(x; \theta_0 \odot m)$ for $T$ steps to obtain $\theta_T$.
- Prune the smallest $k\%$ of weights by magnitude, updating $m$ accordingly.
- Reset the surviving weights to their values in $\theta_0$.
- Repeat from step 2 until desired sparsity.
The crucial step is the reset to $\theta_0$: pruning + reset reveals the ticket; pruning + random re-init does not. This implies the initialisation of the surviving weights, not just the mask structure, is essential.
Empirical findings.
- On MNIST, CIFAR-10 and small ImageNet variants, IMP finds winning tickets at 10-20% of original size that match dense accuracy.
- Linear mode connectivity (Frankle et al., 2020): for larger networks like ResNet-50, winning tickets only emerge after a brief warm-up of $k$ steps; resetting to $\theta_k$ rather than $\theta_0$ ("late rewinding") yields the strongest results.
- Strong lottery tickets (Ramanujan et al., 2020): an even bolder claim, sufficiently overparameterised networks contain subnetworks that match dense performance at initialisation, with no training of the surviving weights. Proven for ReLU networks.
Theoretical results. Malach et al. (2020) proved a version of the strong hypothesis: any target network of width $n$ can be approximated to arbitrary accuracy by a subnetwork of a randomly-initialised network of width polynomial in $n$, pruning is enough. Pensia et al. (2020) tightened the polynomial overhead to logarithmic.
Implications.
- Trainability depends on initialisation. Networks pruned at initialisation usually fail to train; the lottery ticket finding shows specific combinations of structure and weights are uniquely trainable.
- Most parameters are redundant. Modern networks contain orders-of-magnitude more parameters than needed for the function actually learned, but only a small fraction are critical and they cannot be identified without training.
- Pruning research direction. Modern compression methods (early-bird tickets, supermasks, edge-popup) build on lottery ticket ideas to compress networks during or before training.
- Theoretical puzzle. Why does the winning ticket need its specific initial values? Linear mode connectivity suggests the answer involves loss-landscape geometry: tickets identified after warm-up lie in the same connected basin of the loss as the dense solution.
The lottery ticket hypothesis reshaped thinking about overparameterisation, suggesting it functions less as direct capacity and more as a search space within which sparse, trainable structures can be discovered.
Video
Related terms: Implicit Regularisation, Regularisation, Double Descent, Gradient Descent
Discussed in:
- Chapter 6: ML Fundamentals, Theory of Deep Learning