Neural Networks: 9.12   Regularisation

Dr Chris Paton

9.12 Regularisation

A neural network with a million parameters and only ten thousand training examples is a memorising machine waiting to happen. Given enough capacity, gradient descent will find a parameter setting that drives the training loss almost to zero, every example fitted, every quirk absorbed, every stray pixel learnt by heart. The problem is that "fitting the training set" and "understanding the underlying regularity" are not the same thing. A model that has merely memorised will perform brilliantly on the data it has seen and woefully on anything new. This phenomenon is overfitting, and regularisation is the family of techniques that pushes back against it. The intuition is the same one any teacher recognises: there is a difference between knowing arithmetic and knowing the multiplication tables by rote. The first generalises to numbers you have never multiplied before; the second collapses the moment you step outside the test the pupil has memorised.

Regularisation sits between two neighbouring topics in this chapter. Section 9.9 chose the loss function, the quantity the optimiser tries to minimise. Section 9.12 (this section) modifies that quantity by adding penalty terms or imposing other constraints, all designed to discourage the network from solutions that fit the seen examples too snugly. Section 9.13 covers normalisation, which is a related but distinct family: rather than penalising parameters, normalisation rescales activations layer-by-layer to keep training stable. The two families often work together, but they answer different questions.

Symbols Used Here

$\mathcal{L}_{\text{data}}$data loss (e.g. MSE or cross-entropy)

$\mathcal{L}_{\text{total}}$total loss (data loss + regularisation)

$\theta$model parameters

$\mathbf{w}$vector of all weights (treating biases separately by convention)

$\lambda$regularisation strength (a non-negative real hyperparameter)

$p$dropout rate, a probability in $[0, 1)$

$\|\mathbf{w}\|_p$$L^p$ norm of weight vector

$\mathbf{m}$dropout mask (binary vector, multiplied element-wise with activations)

$\eta$learning rate

$\varepsilon$label-smoothing parameter

The bias–variance tradeoff in plain words

Two distinct kinds of error eat away at any predictive model, and the deepest lesson in machine learning is that you usually have to trade them off against each other. Imagine drawing a different training set from the same population a hundred times, fitting your model to each set, and looking at the family of fitted models that results.

Bias is the error that comes from the model class itself being wrong. If the true relationship between $x$ and $y$ is a smooth quadratic curve and you insist on fitting it with a straight line, no quantity of data will help. Every fitted line will, on average, miss the same way. The model class is too rigid to express the truth. Bias is "how far off, on average, is the centre of the family of fitted models from the truth?"

Variance is the error that comes from the model class being too flexible relative to the data. A model with a million parameters fitted to a hundred examples will twist itself through every example perfectly, but it will twist in a different direction for every fresh draw of training data. Each fitted model is precise; the family of fitted models is wildly inconsistent. Variance is "how much do the fitted models disagree with each other across different training sets?"

There is a third source of error, noise, that is irreducible. Even the perfect model cannot predict a measurement that contains genuine random fluctuation. The famous decomposition (informal but useful) says that

$$\text{expected test error} \approx \text{bias}^2 + \text{variance} + \text{noise}.$$

A linear model fitting a quadratic curve has high bias and low variance: it underfits. A million-parameter network fitting a hundred examples has low bias and astronomical variance: it overfits. The art is to find the model class whose bias and variance jointly minimise the total. As model capacity grows, bias falls and variance rises, and the curve of total error against capacity is a U-shape with a sweet spot in the middle. (The deep-learning era has complicated this picture: see "double descent" in §9.20 for the surprise that, with enough capacity, the U-shape continues into a second descent. But for ordinary practical work the U-shape is the right mental model.)

Regularisation acts on the variance side of this tradeoff. It deliberately restricts what solutions the optimiser can find, accepting a slight increase in bias as the price of a much larger reduction in variance. The hope is that the total error falls.

L2 regularisation (weight decay)

The classic regulariser adds a quadratic penalty on the weights. Define the total loss as

$$\mathcal{L}_{\text{total}}(\theta) = \mathcal{L}_{\text{data}}(\theta) + \lambda \cdot \tfrac{1}{2} \|\mathbf{w}\|_2^2 = \mathcal{L}_{\text{data}}(\theta) + \tfrac{\lambda}{2} \sum_i w_i^2.$$

Here $\lambda \geq 0$ is the regularisation strength, a hyperparameter you tune on a validation set. The factor of one half is conventional; it cancels the 2 that appears when you differentiate $w_i^2$. Biases are usually exempt from the penalty because they shift outputs uniformly and shrinking them serves no useful invariance.

The gradient picks up a term proportional to the weight itself:

$$\frac{\partial \mathcal{L}_{\text{total}}}{\partial w_i} = \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_i} + \lambda w_i.$$

Plug this into the SGD update $w_i \leftarrow w_i - \eta \, \partial \mathcal{L}_{\text{total}} / \partial w_i$ and rearrange:

$$w_i \leftarrow w_i - \eta \left( \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_i} + \lambda w_i \right) = (1 - \eta \lambda) \, w_i - \eta \, \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_i}.$$

The first factor is the weight decay: every step, before the data-driven update is applied, every weight is multiplied by the constant shrink factor $(1 - \eta \lambda)$. This is why L2 regularisation and weight decay are usually used as synonyms (with one important caveat about adaptive optimisers, below).

A worked numerical example fixes the idea. Suppose $\eta = 0.01$ and $\lambda = 0.001$, so the per-step shrink factor is $1 - \eta\lambda = 1 - 10^{-5} = 0.99999$. After one step a weight of $1.0$ becomes $0.99999$, a tiny change. After $10000$ steps it becomes $0.99999^{10000} \approx 0.905$. After $69315$ steps it would have halved (since $\ln 2 / 10^{-5} \approx 69315$). The decay is gentle but relentless: in the absence of any data signal pushing the weight outwards, every weight drifts towards zero.

A Bayesian reading makes the prior explicit. Place a Gaussian prior $\mathbf{w} \sim \mathcal{N}(0, \sigma_w^2 I)$ on the weights. The negative log posterior is the negative log likelihood plus the negative log prior; the latter is $\|\mathbf{w}\|_2^2 / (2 \sigma_w^2)$ plus a constant. Comparing with the L2 penalty $(\lambda / 2) \|\mathbf{w}\|_2^2$, we read off $\lambda = 1 / \sigma_w^2$. Strong regularisation corresponds to a tight prior, a strong belief that weights should be small.

Typical values of $\lambda$ in modern deep learning span $10^{-4}$ to $10^{-2}$. Smaller models and bigger datasets need less regularisation; larger models and smaller datasets need more. There is no universal optimum: tune it.

A subtlety arises with adaptive optimisers. Adam normalises each gradient component by an estimate of its second moment, $\hat{v}_t$. If you fold the L2 term into the gradient before this normalisation, parameters with large running second moments receive less effective decay, the opposite of what you usually want. AdamW (Loshchilov and Hutter, 2019) fixes this by applying the decay step $\mathbf{w} \leftarrow \mathbf{w} - \eta \lambda \mathbf{w}$ separately from the gradient update. AdamW is now the standard for training Transformers, and the distinction between "L2 regularisation" and "decoupled weight decay" matters whenever you switch from plain SGD to an adaptive optimiser.

L1 regularisation

The L1 regulariser uses the sum of absolute values rather than squares:

$$\mathcal{L}_{\text{total}}(\theta) = \mathcal{L}_{\text{data}}(\theta) + \lambda \|\mathbf{w}\|_1 = \mathcal{L}_{\text{data}}(\theta) + \lambda \sum_i |w_i|.$$

Differentiating $|w_i|$ gives $\mathrm{sign}(w_i)$ (with the gradient at zero conventionally set to zero or treated by a subgradient), so the gradient contribution is

$$\frac{\partial}{\partial w_i} \big( \lambda |w_i| \big) = \lambda \, \mathrm{sign}(w_i).$$

The SGD update therefore subtracts a constant magnitude from each weight at every step, with sign matching the weight's own:

$$w_i \leftarrow w_i - \eta \lambda \, \mathrm{sign}(w_i) - \eta \, \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_i}.$$

Compare with L2: there, the shrink was proportional to the weight, so a weight of $0.001$ shrank by a thousandth of what a weight of $1.0$ shrank. With L1, both shrink by the same fixed amount each step. The consequence is that small weights are driven to exactly zero rather than merely close to zero. The optimum for L1 lives at a corner of the diamond $\|\mathbf{w}\|_1 = c$, and corners are sparse.

A worked example: take $\eta = 0.01$ and $\lambda = 0.001$, so each step subtracts $\eta \lambda = 10^{-5}$ from $|w_i|$ (whatever its sign). A weight starting at $5 \times 10^{-4}$ that receives no contribution from the data gradient will reach zero in $5 \times 10^{-4} / 10^{-5} = 50$ steps. A weight starting at $0.5$ would take fifty thousand steps to reach zero, long enough that the data signal will probably keep it alive. The result is that weakly informative weights are pruned and only weights with persistent gradient pressure survive. The fitted model has many exact zeros: it is sparse.

The Bayesian interpretation: an L1 penalty corresponds to a Laplace prior on the weights, $p(w_i) \propto \exp(-|w_i|/b)$, which has a sharp peak at zero and so encourages exact-zero solutions. In the linear-model world this is famous as the LASSO (Tibshirani, 1996), where it is the workhorse for variable selection. In deep learning L1 is much less common than L2: the goal of training a deep network is rarely to identify which inputs matter (the answer is "nearly all of them, in some combination") and the optimisation behaves less smoothly. L1 reappears in pruning recipes and in some mixed L1+L2 formulations (Elastic Net), and it remains a useful tool when sparsity is genuinely the goal.

Dropout

Srivastava et al. (2014) proposed dropout: at training time, each unit's activation is multiplied by an independent Bernoulli random variable, so each unit is "dropped" with probability $p$ and kept with probability $1-p$. Concretely, write the layer's pre-scaled output vector as $\mathbf{a}$ and let $\mathbf{m}$ be a binary mask vector of the same shape with $\mathbb{P}(m_i = 1) = 1-p$. The dropped activation is then $\mathbf{a} \odot \mathbf{m}$ where $\odot$ is element-wise multiplication.

There is a small bookkeeping detail. If you simply zero out a fraction $p$ of the activations, then on average the layer outputs are scaled by $(1-p)$ relative to test time, and the network must compensate. Two equivalent conventions handle this. In the original convention, training-time activations are not rescaled, but at test time every activation is multiplied by $(1-p)$ so the expected magnitudes match. In the more common modern convention, inverted dropout, training-time activations after the mask are divided by $(1-p)$, and test time is left untouched. Most deep-learning frameworks implement the second form; the test-time forward pass becomes the deterministic identity.

Why does dropout work? Two intuitions help. First, dropout breaks co-adaptation: a unit cannot rely on any single neighbour, since that neighbour will be missing on a fraction $p$ of the training steps. Each unit must therefore learn features that are useful on their own, alongside many different combinations of partners. The result is a more robust internal representation. Second, dropout trains an exponentially large ensemble of subnetworks (one for every possible mask) with shared weights, and the test-time deterministic pass approximates an average of all those subnetworks. Ensembling reduces variance, that is the whole point.

A worked numerical example. Suppose a layer outputs the pre-scaled vector $\mathbf{a} = (0.7, 0.4, 0.9, 0.2)$ and we use $p = 0.5$. A draw of the mask might give $\mathbf{m} = (1, 0, 0, 1)$. The masked activation is $(0.7, 0, 0, 0.2)$. Inverted dropout then divides by $1-p = 0.5$, giving $(1.4, 0, 0, 0.4)$. The expected value of each entry, averaging over many mask draws, is the original $(0.7, 0.4, 0.9, 0.2)$, the rescaling preserves the mean.

The Bayesian interpretation (Gal and Ghahramani, 2016) is illuminating: dropout in a deep network with appropriate weight decay corresponds to variational inference in a deep Gaussian process. Running the trained network with dropout still active at test time, sampling many forward passes, and averaging the outputs gives Monte Carlo dropout, a cheap predictive-uncertainty estimate.

When to use dropout. Dropout is a strong regulariser for fully-connected layers and was the workhorse of pre-2017 image and speech models. In modern Transformers it is applied at three locations, after the attention softmax, after the feed-forward block, and on the residual sums, typically at rate 0.1. Very large language models often set dropout to zero, on the theory that the implicit regularisation of a billion training tokens is regularisation enough. Dropout interacts badly with batch normalisation (the noise injected by dropout is double-counted by batch norm's statistics), and the modern best practice is to use one or the other, not both, in any given block. Convolutional layers are mostly regularised by other means (data augmentation, weight decay); dropout is rarer in CNN feature extractors but common on the final classifier head.

Early stopping

The cheapest form of regularisation that actually works is to stop training before the model has time to memorise. Plot two curves on the same axis as you train: the training loss and the validation loss, evaluated at every epoch (or every few hundred steps). The training loss falls monotonically, the optimiser is doing its job. The validation loss falls at first, alongside the training loss, but eventually turns around and starts to rise even as the training loss keeps falling. The U-shaped trough in the validation curve marks the moment the model began to fit noise rather than signal.

Early stopping simply terminates training near the bottom of that trough. In code this means: keep a checkpoint of the weights at the lowest validation loss seen so far, and stop training when the validation loss has failed to improve for some patience window, typically 3 to 10 evaluations. At the end, restore the best-checkpoint weights rather than the final-step weights.

A worked example as a sketch in prose. Suppose validation loss across epochs reads 1.50, 1.20, 0.95, 0.78, 0.66, 0.59, 0.55, 0.54, 0.555, 0.561, 0.572. The minimum is 0.54 at epoch 8. With patience = 3, training stops after epoch 11 (three consecutive epochs of no improvement) and reverts to the epoch-8 checkpoint.

Why is this a regulariser? Because the longer you train, the more capacity the optimiser has effectively used, the further the parameters have travelled from their (small, near-zero) initial values. Stopping early limits the effective travel and so limits the effective complexity of the function. There is even a formal connection: for linear models with squared loss and SGD initialised at zero, early stopping is equivalent to a form of L2 regularisation, with the equivalent $\lambda$ a decreasing function of the number of steps. The intuition extends qualitatively to the nonlinear case.

Early stopping is essentially free: it requires only a held-out validation set you would have wanted anyway. It composes cleanly with every other regularisation technique. The only downside is that it ties the regularisation strength to the optimisation schedule, which makes hyperparameter sweeps slightly noisier.

Data augmentation

If you cannot afford more training data, you can manufacture variations of the data you already have. Data augmentation applies random transformations to the training inputs that should not change the label, and trains on the resulting flood of synthetic variants. The model is told: "a dog photographed from the left and a dog photographed from the right are both still a dog; please learn an internal representation that recognises this."

For images, the standard menu includes random horizontal flips, random crops with rescaling, small rotations, colour jitter (changes to brightness, contrast, saturation, hue), and Gaussian noise. ImageNet training without augmentation will overfit somewhere around 50 epochs; with random crops and flips it can train for 100 or more epochs without the validation curve turning up. Modern recipes layer many augmentations together: AutoAugment (Cubuk et al., 2019) and RandAugment (Cubuk et al., 2020) automate the choice of policy. MixUp (Zhang et al., 2018) trains on convex combinations of two inputs and their labels; CutMix (Yun et al., 2019) replaces a rectangular region of one image with another and weights the labels by the area ratio. Both encourage smoother decision boundaries between classes.

For text, augmentation is trickier because most transformations risk changing the meaning. Synonym replacement, random masking (the foundation of BERT pretraining), and back-translation (translate to French and back) are the standard tools, with span masking dominant in modern pretraining. For audio, SpecAugment masks bands of frequency and time in the spectrogram, plus pitch shift and time warp. For tabular data, augmentation is hardest of all: small Gaussian noise on continuous features and MixUp-style interpolation are about the only reliable choices.

The Bayesian reading is that augmentation injects a prior on invariances. By telling the model that flipping an image left-to-right does not change the label, you are encoding the prior belief that the underlying class-membership function is invariant to that transformation. The transformation budget becomes part of the model. Augmentation is one of the most cost-effective regularisers known, especially when augmentations matched to the domain are available.

Label smoothing

A one-hot target tells the network to assign all of the probability mass to one class and zero to every other. This is rarely realistic, even a perfectly correct model should retain some uncertainty about whether the image really is a Persian cat versus a Maine Coon. Label smoothing (Szegedy et al., 2016) replaces the one-hot vector $\mathbf{y}$ with a softened distribution

$$\mathbf{y}^\varepsilon = (1-\varepsilon) \mathbf{y} + \frac{\varepsilon}{K} \mathbf{1},$$

where $K$ is the number of classes and $\mathbf{1}$ is the all-ones vector. With $\varepsilon = 0.1$ and $K = 3$, the original target $(1, 0, 0)$ becomes $(1 - 0.1 + 0.1/3, \, 0.1/3, \, 0.1/3) \approx (0.933, 0.033, 0.033)$.

The effect on training is that the cross-entropy loss now penalises overconfident predictions. With a one-hot target the loss only reaches zero in the limit as the predicted probability for the true class approaches one, which requires the corresponding logit to grow without bound. With smoothed targets the optimal logit configuration places a finite gap between the true class and the others, the gap determined by $\varepsilon$. Logits stay bounded, the network stays calibrated, and accuracy often nudges up by a fraction of a percent.

Label smoothing is standard in ImageNet classification (Inception v3 introduced it), in machine translation, and in every Transformer trained with cross-entropy on a multiclass target. The hyperparameter $\varepsilon$ is almost always set to 0.1 and almost never tuned. The cost is one line of code at the loss layer.

Stochastic weight averaging and EMA

After training, the trajectory of the optimiser through parameter space rarely settles on a single point, it oscillates around a region. Different points along that trajectory are different fitted models, each slightly different from its neighbours. Averaging them often produces a model that generalises better than any single point.

Stochastic weight averaging (SWA, Izmailov et al., 2018) makes this concrete. After an initial training period at a normal learning rate, switch to a constant or cyclic learning rate and collect $K$ checkpoints. Take the simple mean of those checkpoints' weights (one mean per parameter) and use the averaged weights at inference. SWA tends to find flatter minima than the optimiser's final point, and flat minima generalise better because small perturbations to the input or the parameters change the output less.

Exponential moving average (EMA) maintains a running average of the weights during training rather than after. Maintain a shadow copy $\bar\theta$ initialised to the starting weights and update at every step:

$$\bar\theta_t = \alpha \, \bar\theta_{t-1} + (1 - \alpha) \, \theta_t,$$

with $\alpha$ typically around $0.999$. With $\alpha = 0.999$ the EMA has an effective averaging window of about $1 / (1 - \alpha) = 1000$ steps. At inference, swap the live weights for the EMA shadow weights. EMA is standard practice in modern image generation (every diffusion model uses it), self-supervised pretraining (the teacher network in BYOL, DINO, MoCo is an EMA of the student), and large-language-model post-training. The cost is one extra parameter copy in memory and one cheap update per step.

Architectural regularisation

Some choices of architecture act as regularisers without any explicit penalty term. Skip connections (He et al., 2016) make the identity easy to represent, biasing the network towards small per-layer modifications of its input, a strong inductive bias against overfitting. Stochastic depth (Huang et al., 2016) drops entire residual blocks at random during training, a structured form of dropout that reduces the expected depth of the network and accelerates training. DropPath applies the same idea to the parallel branches of multi-branch architectures.

Width matters too. A wider layer with stronger regularisation often outperforms a narrower layer with weaker regularisation, because wider networks have a smoother loss landscape. Mixture-of-Experts gates partition the parameter space across many experts and route each token to only a few of them, which yields enormous capacity while sharply restricting effective per-example sharing, an implicit regulariser. Sparsely-activated networks more generally fit fewer parameters per example than their parameter count would suggest. None of these architectural choices look like the L2 penalty mathematically, but they all reduce variance and so belong in the regularisation toolbox.

Choosing what to combine

A practical recipe for a fresh problem.

Start with L2 weight decay at $\lambda = 10^{-4}$. If you are using Adam, switch to AdamW so the decay is decoupled from the gradient normalisation.
Add the strongest data augmentation that is appropriate to the domain. For images, RandAugment plus random crops and flips. For text, span masking. For audio, SpecAugment. Augmentation is often the largest single contributor to generalisation.
Add label smoothing with $\varepsilon = 0.1$ for any classification task.
Hold out a validation set and use early stopping with patience 3 to 10 evaluations.
For small datasets (say, fewer than ten thousand examples per class) add dropout at rate 0.1 to 0.5 in fully-connected layers.
For models you intend to deploy, consider an EMA of the weights, it costs almost nothing and tends to help.
Tune $\lambda$ and the dropout rate on the validation set. A factor-of-ten grid search is usually enough.

These techniques are largely complementary: dropout, weight decay, augmentation, and early stopping all reduce variance through different mechanisms, and stacking them rarely hurts as long as each is at a reasonable strength. The only common interaction worth flagging is dropout + batch norm, which interact poorly enough that one of the two is usually preferred over both.

What you should take away

Overfitting is the failure mode where a model memorises its training set and fails on new data. Regularisation is the family of techniques that combats it by reducing variance at a small cost in bias.
L2 regularisation (weight decay) adds $(\lambda / 2) \|\mathbf{w}\|_2^2$ to the loss; the SGD update becomes $w \leftarrow (1 - \eta\lambda) w - \eta \, \partial \mathcal{L}_{\text{data}} / \partial w$, multiplicatively shrinking every weight at every step.
L1 regularisation produces sparse solutions because each weight receives a constant-magnitude shrink per step rather than a proportional one; small weights are driven exactly to zero. It is the basis of LASSO and is rarer in deep learning than L2.
Dropout zeroes activations at random during training and rescales by $1/(1-p)$, training an implicit ensemble of subnetworks with shared weights. Use 0.1 in modern Transformers, 0.5 in older fully-connected nets, zero with very large data.
Early stopping, data augmentation, label smoothing ($\varepsilon = 0.1$), and weight averaging (SWA / EMA) are all near-free additions to any training pipeline; combine them rather than choosing between them.
The practical default recipe is AdamW at $\lambda = 10^{-4}$, strong domain-appropriate data augmentation, label smoothing $\varepsilon = 0.1$, early stopping on a validation set, and an EMA of the weights at inference time.