Regularisation modifies a learning objective by adding a penalty term that discourages model complexity:
$$\mathcal{L}_\mathrm{reg}(\theta) = \mathcal{L}_\mathrm{data}(\theta) + \lambda R(\theta)$$
The data-fitting loss $\mathcal{L}_\mathrm{data}$ measures how well the model fits the training data; the regulariser $R(\theta)$ penalises some notion of complexity; $\lambda > 0$ trades the two off.
L2 regularisation (ridge / weight decay):
$$R(\theta) = \frac{1}{2} \|\theta\|_2^2$$
Equivalent to a Gaussian prior $\theta \sim \mathcal{N}(0, \lambda^{-1} I)$ in the Bayesian view (the regularised loss is the negative log posterior). Shrinks all parameters toward zero. Differentiable, convex, has closed-form solutions for linear models. Standard for nearly all neural networks.
L1 regularisation (lasso):
$$R(\theta) = \|\theta\|_1 = \sum_i |\theta_i|$$
Equivalent to a Laplace prior $\theta \sim \mathrm{Laplace}(0, \lambda^{-1})$. Induces sparsity, many parameters exactly zero. Useful for feature selection and interpretable models. Non-differentiable at zero; solved by coordinate descent or proximal methods. The lasso in linear regression is the canonical example.
Elastic net: combines $\lambda_1 \|\theta\|_1 + \frac{\lambda_2}{2} \|\theta\|_2^2$. Selects features in groups, addresses lasso's instability when features are correlated.
Early stopping is implicit regularisation: stop training before the model has fully fit the data. Equivalent (in a sense) to L2 regularisation for linear models.
Dropout randomly zeros activations during training: each unit is kept with probability $p$. Acts as approximate Bayesian inference (Gal & Ghahramani 2016) and prevents co-adaptation of features. Standard in fully-connected layers; less common in modern Transformer architectures.
Batch normalisation has a regularising effect from the noise of mini-batch statistics.
Data augmentation is a form of regularisation: training on augmented versions of the data (rotations, crops, mixup, cutout, mixmatch) effectively imposes invariances that the model might otherwise overfit to.
Label smoothing: replace one-hot targets with $(1 - \epsilon)$ one-hot $+ \epsilon / K$ uniform. Prevents the model from being over-confident; standard in modern image classification and language modelling.
Spectral normalisation: constrain the spectral norm (largest singular value) of weight matrices. Used in WGAN-GP and other generative models.
Stochastic depth / DropPath: randomly skip residual blocks during training. Used in EfficientNet, modern ViT.
Weight decay vs L2: in plain SGD they are identical; with adaptive optimisers (Adam) AdamW decouples weight decay from the gradient update, giving better empirical results.
Implicit regularisation of SGD: stochastic gradient noise, in non-convex landscapes, biases solutions toward flat minima that generalise better. Empirical observation; theoretical understanding incomplete.
The bias-variance trade-off that regularisation manages:
- Too little regularisation → high variance, overfitting.
- Too much → high bias, underfitting.
- $\lambda$ chosen by cross-validation, held-out validation set, or Bayesian model selection.
Modern overparameterised neural networks (more parameters than training examples) have changed the picture: heavy regularisation often hurts, and the model can interpolate the training data while still generalising well, the double descent phenomenon.
Interactive
Video
Related terms: Dropout, Bias-Variance Tradeoff
Discussed in:
- Chapter 6: ML Fundamentals, Machine Learning Fundamentals