Glossary

Regularisation

Regularisation modifies a learning objective by adding a penalty term that discourages model complexity:

$$\mathcal{L}_\mathrm{reg}(\theta) = \mathcal{L}_\mathrm{data}(\theta) + \lambda R(\theta)$$

The data-fitting loss $\mathcal{L}_\mathrm{data}$ measures how well the model fits the training data; the regulariser $R(\theta)$ penalises some notion of complexity; $\lambda > 0$ trades the two off.

L2 regularisation (ridge / weight decay):

$$R(\theta) = \frac{1}{2} \|\theta\|_2^2$$

Equivalent to a Gaussian prior $\theta \sim \mathcal{N}(0, \lambda^{-1} I)$ in the Bayesian view (the regularised loss is the negative log posterior). Shrinks all parameters toward zero. Differentiable, convex, has closed-form solutions for linear models. Standard for nearly all neural networks.

L1 regularisation (lasso):

$$R(\theta) = \|\theta\|_1 = \sum_i |\theta_i|$$

Equivalent to a Laplace prior $\theta \sim \mathrm{Laplace}(0, \lambda^{-1})$. Induces sparsity, many parameters exactly zero. Useful for feature selection and interpretable models. Non-differentiable at zero; solved by coordinate descent or proximal methods. The lasso in linear regression is the canonical example.

Elastic net: combines $\lambda_1 \|\theta\|_1 + \frac{\lambda_2}{2} \|\theta\|_2^2$. Selects features in groups, addresses lasso's instability when features are correlated.

Early stopping is implicit regularisation: stop training before the model has fully fit the data. Equivalent (in a sense) to L2 regularisation for linear models.

Dropout randomly zeros activations during training: each unit is kept with probability $p$. Acts as approximate Bayesian inference (Gal & Ghahramani 2016) and prevents co-adaptation of features. Standard in fully-connected layers; less common in modern Transformer architectures.

Batch normalisation has a regularising effect from the noise of mini-batch statistics.

Data augmentation is a form of regularisation: training on augmented versions of the data (rotations, crops, mixup, cutout, mixmatch) effectively imposes invariances that the model might otherwise overfit to.

Label smoothing: replace one-hot targets with $(1 - \epsilon)$ one-hot $+ \epsilon / K$ uniform. Prevents the model from being over-confident; standard in modern image classification and language modelling.

Spectral normalisation: constrain the spectral norm (largest singular value) of weight matrices. Used in WGAN-GP and other generative models.

Stochastic depth / DropPath: randomly skip residual blocks during training. Used in EfficientNet, modern ViT.

Weight decay vs L2: in plain SGD they are identical; with adaptive optimisers (Adam) AdamW decouples weight decay from the gradient update, giving better empirical results.

Implicit regularisation of SGD: stochastic gradient noise, in non-convex landscapes, biases solutions toward flat minima that generalise better. Empirical observation; theoretical understanding incomplete.

The bias-variance trade-off that regularisation manages:

Modern overparameterised neural networks (more parameters than training examples) have changed the picture: heavy regularisation often hurts, and the model can interpolate the training data while still generalising well, the double descent phenomenon.

Interactive

Overfitting and early stopping. Training loss keeps falling. Validation loss bottoms out, then rises. The gap is overfitting.
Dropout zeros out a random subset of activations each forward pass. Half of the neurons are silenced randomly, forcing the network to spread information across many paths.
Lasso vs Ridge: regularisation paths. As the penalty grows, Lasso sets coefficients to zero one by one; Ridge shrinks all together.

Video

Related terms: Dropout, Bias-Variance Tradeoff

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.