ML Fundamentals: 6.13   Implicit regularisation of SGD

Dr Chris Paton

6.13 Implicit regularisation of SGD

A long-standing puzzle: why does plain SGD generalise so well? Two networks of the same architecture with the same training loss can have very different test losses depending on which optimiser found the minimum. SGD's choice matters.

The mainstream explanation, sketched by Hardt, Recht, and Singer 2016 and refined since, is that SGD has an implicit bias towards flat minima. A "flat" minimum is one where the loss surface is locally close to constant; small perturbations to the weights barely change the loss. A "sharp" minimum is one where the loss rises rapidly in some direction. Empirically, flat minima generalise better, possibly because they correspond to functions that are stable to small perturbations (a Lipschitz argument).

SGD with mini-batches injects gradient noise inversely proportional to the batch size: smaller batches inject more noise. The noise drives the optimiser out of sharp minima and toward flat ones. This explains why smaller batch sizes generalise better holding the number of epochs fixed, and why large-batch training with linear LR scaling Goyal, 2017 needs a careful warm-up to recover the small-batch performance.

A second strand: SGD on overparameterised models converges to the min-norm interpolant, which is a known good inductive bias for many problems. This is provably true for linear models and partially true for wide networks via NTK arguments.

The practical implication: optimiser choice is part of your model. Adam beats SGD on language models because of the gradient-magnitude variance across parameters; SGD-with-momentum beats Adam on vision in many settings; AdamW beats Adam when there is meaningful weight decay. The choice is not just speed of convergence; it is the bias toward solutions that generalise.