Implicit Regularisation, Glossary, Textbook of AI

Implicit regularisation denotes the inductive bias arising from the optimisation algorithm rather than from an explicit penalty in the loss. In overparameterised settings the loss has infinitely many global minima; the algorithm decides which one is reached, and that choice often determines generalisation.

The canonical result. Consider least-squares regression $\min_\theta \tfrac{1}{2} \|X \theta - y\|^2$ with $X \in \mathbb{R}^{N \times p}$, $p > N$. The set of interpolating $\theta$ is an affine subspace. Gradient descent initialised at $\theta_0 = 0$ converges to the minimum $\ell_2$-norm interpolant

$$\theta^* = X^\top (X X^\top)^{-1} y = \arg\min_\theta \|\theta\|_2 \quad \text{s.t.} \quad X \theta = y.$$

No explicit $\ell_2$ penalty appears in the loss, yet GD finds the minimum-norm solution, implicit ridge regularisation.

Other algorithms, other geometries.

Mirror descent with potential $\Psi$ converges to the minimum-$\Psi$ interpolant. Gradient descent corresponds to $\Psi = \tfrac{1}{2}\|\theta\|^2$; using $\Psi = \sum_i \theta_i \log \theta_i$ recovers maximum-entropy solutions.
Coordinate descent on logistic regression converges to the maximum-margin classifier in the $\ell_1$ sense, while gradient descent gives the $\ell_2$-margin classifier (Soudry et al., 2018).
Sign gradient descent (Adam-like) has a different bias toward $\ell_\infty$ geometry.

Implicit bias of gradient descent on classification. For separable data with logistic loss, gradient descent on a linear model converges in direction to the maximum $\ell_2$-margin solution, even though the loss has no margin term:

$$\frac{\theta_t}{\|\theta_t\|} \to \frac{\theta_\mathrm{SVM}}{\|\theta_\mathrm{SVM}\|}.$$

This explains why logistic regression with no explicit regularisation matches hard-margin SVM. The same result extends to homogeneous neural networks (Lyu and Li, 2020), where direction-of-convergence is to a KKT point of the max-margin problem.

Sources of implicit regularisation in deep learning.

Architecture and initialisation. Width, depth, residual connections, initialisation scale all bias the function class explored.
SGD noise. Mandt et al. (2017) modelled SGD as continuous-time Langevin dynamics with covariance proportional to the Hessian-aligned noise, biasing toward flat minima, minima where small parameter perturbations cause small loss change. Keskar et al. (2017) showed flat minima generalise better empirically.
Learning rate. Large learning rates cannot enter sharp minima (the loss explodes), so they implicitly select flat regions. Smith et al. (2021) showed large learning rate schedules act like an explicit regulariser of the trace of the Hessian.
Weight initialisation scale. Small initialisation produces feature-learning regimes; large initialisation gives lazy / NTK regimes. Different scales select different solutions.
Early stopping. Stopping before convergence is implicitly equivalent to ridge regression with a regularisation parameter set by the number of steps.

Why it matters. Implicit regularisation is the leading explanation for why overparameterised networks generalise. Zhang et al. (2017) demonstrated that deep networks can perfectly memorise random labels yet generalise on real data, a contradiction for any complexity-based theory. The resolution is that on structured data, gradient-based optimisers find low-norm, low-margin, flat solutions; on noise, no such solution exists, so optimisation finds high-norm, sharp solutions that fit but generalise poorly.

Open problems. A complete theory of implicit regularisation for deep networks remains elusive. Most rigorous results address linear models, single-hidden-layer networks, or homogeneous networks. The interaction of architecture, initialisation, optimiser, and learning-rate schedule produces a complex inductive bias that the field is still mapping out.

Discussed in:

Chapter 6: ML Fundamentals, Generalisation in Deep Learning

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).