ML Fundamentals: 6.6   Capacity, complexity, and regularisation

Dr Chris Paton

6.6 Capacity, complexity, and regularisation

A model's capacity is its expressiveness, the range of functions it can represent. A linear regression has rather little capacity: it can only draw straight lines through the data. A deep neural network with millions of parameters has enormous capacity: it can carve the input space into intricate, almost arbitrary shapes. Capacity is not, by itself, good or bad. It is a dial. Turn the dial too low and the model cannot capture the genuine pattern in the data, it underfits. Turn it too high and the model latches onto the noise as well as the signal, it memorises the training set and performs poorly on anything new. The art of machine learning, more than half of it, is finding the right setting of this dial.

Regularisation is the collective name for the techniques we use to control effective capacity. Sometimes regularisation works by literally shrinking the model, fewer parameters, smaller weights. Sometimes it works by injecting noise during training, or by stopping training early, or by sharing parameters between layers. The unifying idea is that regularisation tilts the optimisation away from solutions that fit the training data perfectly and towards solutions that are simpler, smoother, or more constrained, and therefore more likely to generalise.

§6.4 introduced generalisation as the central problem of supervised learning: we want low error on data we have never seen, not just on the training set. §6.6 is where we meet the levers we actually pull to manage that gap. §9.12 returns to the topic specifically for neural networks, where dropout, batch normalisation, and a host of architecture-specific tricks become essential.

Symbols Used Here

$|\mathcal{H}|$size of hypothesis class

$\mathcal{H}$hypothesis class

$\lambda$regularisation strength

$\|\mathbf{w}\|$norm of weights

$p$number of parameters

Capacity

Informally, the capacity of a model class is the answer to the question: how many distinct functions can the class represent? If your class can only represent ten functions, then no matter how clever your training algorithm, the best you will ever do is the best of those ten. If your class can represent every function imaginable, including ones that fit any random labelling of any dataset, then training data alone cannot tell you which function to prefer, you can always find one that fits perfectly, including the noise.

Formally, capacity is measured in several ways, each capturing a different aspect of "expressiveness".

Number of parameters. The simplest measure. A linear regression on $d$ input features has $d+1$ parameters (one weight per feature plus a bias). A neural network with millions of weights has, well, millions. Parameter count is a crude proxy: it ignores how the parameters are wired together, and it can vastly overstate or understate the true expressive power. But it is easy to compute and often correlates with the more refined measures.

VC dimension. The Vapnik–Chervonenkis dimension is a classical, more careful measure. It is the size of the largest set of points that the model class can label in every possible way. If you can find $h$ points such that, for every one of the $2^h$ possible binary labellings of those points, some model in your class gets all the labels right, then the VC dimension is at least $h$. Linear classifiers in $\mathbb{R}^d$ have VC dimension exactly $d+1$: three points in the plane can be separated in any of the $2^3 = 8$ ways by some line, but four points cannot in general. Polynomials of degree $k$ over the real line have VC dimension $k+1$. A decision tree of depth $d$ partitions the input into at most $2^d$ regions, giving VC dimension at most $2^d$, exponential in depth, which is why deep trees overfit so easily. For neural networks the VC dimension scales roughly with the number of parameters, although the precise bound depends sensitively on architecture and on the activation functions used.

Rademacher complexity. A more modern, distribution-aware measure. Rademacher complexity asks how well functions in your class can fit random noise: if I label your training points uniformly at random, by flipping a coin for each one, how good a fit can the class still produce? A class that can fit pure noise has very high effective capacity. Crucially, Rademacher complexity depends on the data distribution, not just the model class, which makes it tighter than VC dimension in many real settings.

Covering numbers offer yet another lens, counting how many "representative" functions are needed to approximate every function in the class to within a given tolerance. They generalise easily to infinite classes and connect cleanly with information-theoretic generalisation bounds.

The take-away for a beginner is this: capacity is not a single number, but several closely-related notions, all of which try to capture how flexible your model class is. More flexible means more expressive, and harder to keep on the rails.

The capacity-generalisation tradeoff

The whole reason we care about capacity is that it controls the gap between training error and test error.

A model with too little capacity cannot represent the true relationship in the data at all. It will have a high training error and a similar (high) test error. We say it has high bias: a built-in mismatch between the model class and reality. Drawing a straight line through points that genuinely lie on a curve is the textbook example. No matter how much data you collect, the line will never bend. The errors are systematic, not random.

A model with too much capacity can fit the training points exactly, including any random fluctuations or measurement noise. It will have very low training error but high test error, because the wiggles it has learned reflect that particular training set, not the underlying truth. We say it has high variance: tiny changes in the training sample produce wildly different fitted models. Pass a million-degree polynomial through twenty data points and watch it explode between them.

The sweet spot lies between these extremes. We want enough capacity to capture the genuine structure but not so much that we model the noise. In practice, we find this sweet spot by cross-validation: hold out part of the training data as a validation set, fit candidate models of varying capacity on the rest, evaluate each on the held-out portion, and pick the one that does best on data the model did not see during fitting. The hyperparameters that control capacity (polynomial degree, tree depth, regularisation strength, number of hidden units) become things we tune rather than guess.

A useful mental picture: as capacity increases, training error falls monotonically (more flexibility means a tighter fit), but test error first falls and then rises. The minimum of the test-error curve is what we are hunting for.

L2 regularisation (weight decay)

The most common regulariser, by an enormous margin, is L2 regularisation, also called weight decay or ridge regression depending on the context. The idea is simple: add a penalty term to the loss function that grows with the size of the weights. Specifically, replace the loss $L(\mathbf{w})$ with

$$ L(\mathbf{w}) + \frac{\lambda}{2} \|\mathbf{w}\|_2^2, $$

where $\|\mathbf{w}\|_2^2 = \sum_i w_i^2$ is the squared Euclidean norm of the weight vector and $\lambda \geq 0$ is the regularisation strength. The factor of $\tfrac{1}{2}$ is cosmetic: it makes the gradient cleaner.

Compute the gradient of the penalty: it is simply $\lambda \mathbf{w}$. So at every step of gradient descent we add an extra pull towards zero, proportional to the current weight. Large weights are pulled hard; small weights are pulled gently. The effect is a smooth, continuous shrinkage of all weights towards the origin. None of them are forced exactly to zero, but they are all kept on a leash.

There is a clean Bayesian interpretation. Adding L2 regularisation is mathematically equivalent to placing a Gaussian prior $\mathcal{N}(0, 1/\lambda)$ on each weight and finding the maximum-a-posteriori estimate. A larger $\lambda$ corresponds to a tighter prior, a stronger belief that weights should be near zero. Through this lens, regularisation is not a hack; it is the rigorous Bayesian answer to "what should I believe before I see the data?".

Geometrically, L2 regularisation favours solutions where the weight vector lies inside a Euclidean ball. The boundary of that ball is smooth and rotationally symmetric, which is why L2 produces smooth, dense solutions: every weight is shrunk a little, none is killed entirely.

In practice, L2 regularisation tames overfitting, keeps gradients well-behaved, and almost always improves generalisation when used at a sensibly chosen strength. It is the default regulariser for linear regression, logistic regression, support vector machines, and the great majority of neural-network optimisers.

L1 regularisation (lasso)

A close cousin replaces the squared norm with the sum of absolute values:

$$ L(\mathbf{w}) + \lambda \|\mathbf{w}\|_1, \quad \|\mathbf{w}\|_1 = \sum_i |w_i|. $$

This is L1 regularisation, also called the lasso in the statistics literature. The penalty looks similar to L2 but the consequences are surprisingly different. L1 drives many weights to exactly zero, producing sparse solutions in which only a few features actually contribute to the prediction.

Why does L1 produce sparsity while L2 only shrinks? The geometry. The unit ball in the L1 norm is a diamond (or a high-dimensional cross-polytope), with sharp corners on each axis. Whenever the unconstrained loss minimum lies outside this diamond, the constrained minimum tends to land at one of the corners, where most coordinates are zero. The unit ball in the L2 norm, by contrast, is a smooth round sphere with no special points, so the constrained minimum sits somewhere on the surface with all coordinates non-zero but small.

The Bayesian interpretation here is a Laplace prior on each weight, which has a sharp peak at zero, encoding a stronger prior belief that most weights really are zero, not merely small.

L1 is the regulariser of choice when you suspect that only a small subset of input features actually matters and you want the model itself to tell you which ones. It is widely used in genomics (which genes predict disease?), text classification (which words matter?), and sparse signal recovery. The downside: optimisation is harder, because the absolute value is not differentiable at zero, and standard gradient descent must be replaced with subgradient methods or proximal algorithms.

The elastic net combines both penalties: $\lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2$. It captures the feature-selection behaviour of L1 while retaining the stability of L2 when features are correlated.

Other regularisers

Beyond the L1/L2 pair, the practitioner's toolkit contains several other widely used regularisers, each with its own flavour.

Dropout. During training, randomly set a fraction $p$ of activations to zero on each forward pass. The network is forced to develop redundant pathways, since any unit might disappear at any time. At test time all units are kept and their outputs scaled to compensate. Dropout is roughly equivalent to averaging over an exponentially large ensemble of sub-networks, a powerful regulariser for deep neural networks.

Data augmentation. Synthesise new training examples by applying label-preserving transformations to existing ones. For images: random crops, horizontal flips, rotations, brightness and colour jitter. For audio: time stretching, pitch shifting, additive noise. For text: synonym substitution, back-translation. Augmentation effectively multiplies the size of the training set and forces the model to be invariant to the chosen transformations. It is often the single most effective regulariser in computer vision.

Early stopping. Train for fewer epochs. Monitor validation error during training and stop as soon as it begins to rise, even if training error is still falling. Early stopping is essentially free, requires no change to the loss function, and is mathematically equivalent (for quadratic losses with small step sizes) to L2 regularisation with a strength controlled by the iteration count.

Batch normalisation. Normalise the activations of each layer to have zero mean and unit variance within each mini-batch, then apply a learned affine transform. The original motivation was to stabilise training, but it turns out that the noise introduced by tying each example's normalisation to its mini-batch mates acts as a stochastic regulariser.

Label smoothing. Replace one-hot training targets, say, the vector $(0, 0, 1, 0, 0)$ for class 3, with a softened version such as $(0.025, 0.025, 0.9, 0.025, 0.025)$. This stops the model from becoming over-confident on training examples and improves calibration on test data.

Spectral normalisation. Constrain the largest singular value of each weight matrix to be at most one. This bounds how much each layer can amplify inputs and stabilises training of generative adversarial networks and other delicate architectures.

Weight tying. Share the same parameters across multiple positions in the network, for example, using the same matrix for input and output embeddings in a language model. This halves the parameter count without halving expressiveness.

Regularisation as constraints

There are two equivalent ways to think about regularisation. The first is the penalty form we have been using: minimise

$$ L(\mathbf{w}) + \lambda \|\mathbf{w}\|^2. $$

The second is the constraint form: minimise $L(\mathbf{w})$ subject to $\|\mathbf{w}\| \leq C$. Here we are saying outright "I will only consider weight vectors within a ball of radius $C$".

By Lagrangian duality, these two formulations are equivalent. Every value of the constraint $C$ corresponds to some value of the penalty $\lambda$, and vice versa: as $\lambda$ increases, the effective ball shrinks; as $\lambda \to 0$, the ball expands to cover all of weight-space and the penalty has no effect.

The constraint view is sometimes more intuitive, "I will not let the weights grow beyond this size", and underlies the analysis of methods like projected gradient descent, where we explicitly project back onto the feasible set after every step. The penalty view is usually easier to optimise and is what most software libraries actually implement. Both are mathematically the same idea, written from different angles.

How much regularisation?

The strength $\lambda$ is a hyperparameter, not something the model learns from the data. We choose it by cross-validation. The standard recipe: start with a coarse logarithmic grid such as $\lambda \in \{10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 10\}$, fit the model at each value, evaluate on a validation set, identify the best, and then optionally refine with a finer grid around it. For large models and expensive training, even a coarse grid may be all you can afford; that is normal and usually sufficient.

A practical rule of thumb: if training error is much lower than validation error, you need more regularisation. If training error is high and validation error is similar, you need less regularisation (or a larger model).

What you should take away

Capacity is the dial that governs flexibility. Too little, and you underfit; too much, and you memorise the training set. Capacity has several formal measures, parameter count, VC dimension, Rademacher complexity, covering numbers, but the intuition is one and the same.
Regularisation controls effective capacity. It tilts the optimisation away from solutions that fit the training data perfectly and towards simpler, smoother, more constrained ones that generalise better.
L2 produces smooth shrinkage; L1 produces sparsity. L2 keeps every weight on a leash but rarely kills any. L1 drives many weights to exactly zero, performing implicit feature selection. The geometric reason, the diamond corners of the L1 ball versus the smooth sphere of the L2 ball, is worth holding in your head.
The toolkit is rich. Dropout, data augmentation, early stopping, batch normalisation, label smoothing, spectral normalisation, and weight tying all act as regularisers, each with its own flavour and natural domain of application. Most modern deep-learning systems combine several at once.
Choose the strength by cross-validation. $\lambda$ is a hyperparameter, not a learned quantity. A coarse logarithmic grid plus a validation set is usually enough; the gap between training and validation error tells you whether to push $\lambda$ up or down.