Regularisation encompasses any technique that constrains or penalises the complexity of a model in order to improve its generalisation. Without regularisation, sufficiently flexible models will fit not only genuine patterns but also the noise in the training data. Regularisation injects a preference for simpler solutions, effectively navigating the bias–variance tradeoff toward lower total error.
L2 regularisation (ridge, weight decay) adds a penalty proportional to the squared norm of the parameters: $L_{reg} = L + \lambda |\mathbf{w}|_2^2$. It shrinks weights toward zero and has a Bayesian interpretation as a Gaussian prior. L1 regularisation (lasso) uses the $\ell_1$ norm instead, driving many weights exactly to zero and thus performing feature selection. Elastic net combines both. Dropout randomly deactivates neurons during training, approximately training an ensemble of sub-networks. Batch normalisation has a mild regularising effect through its mini-batch stochasticity. Data augmentation synthesises new training examples via label-preserving transformations. Early stopping halts training when validation loss begins rising.
Each technique trades a small increase in bias for a larger reduction in variance. The regularisation strength is typically chosen by cross-validation. Modern deep learning often combines several regularisers: weight decay plus dropout plus data augmentation plus early stopping, each contributing a distinct form of complexity control. Label smoothing and mixup are more recent additions, both of which improve calibration and robustness in classification.
Related terms: Overfitting, Dropout, Weight Decay, Batch Normalisation
Discussed in:
- Chapter 6: ML Fundamentals — Regularisation
- Chapter 10: Training & Optimisation — Regularisation in Deep Learning
Also defined in: Textbook of AI, Textbook of Medical AI