Also known as: batch norm, BN
Batch Normalisation (BN), introduced by Ioffe and Szegedy in 2015, normalises the activations of a layer to have zero mean and unit variance within each mini-batch. For each neuron, BN computes the batch mean $\mu_B$ and variance $\sigma_B^2$ over the mini-batch, normalises as $\hat{x} = (x - \mu_B) / \sqrt{\sigma_B^2 + \epsilon}$, and then applies a learnable affine transformation $y = \gamma \hat{x} + \beta$. The learnable $\gamma$ and $\beta$ allow the network to recover the identity if that is optimal.
The practical benefits are substantial. Networks trained with BN are far less sensitive to learning rate and weight initialisation, because normalisation constrains activation magnitudes regardless of upstream evolution. This permits higher learning rates and faster training. BN also has a mild regularising effect because each example's normalised value depends on which others appear in the batch, injecting stochastic noise that discourages overfitting.
At inference, BN switches to running averages of the training-time statistics, because single examples have no batch to normalise over. Failing to switch to eval mode is a common implementation bug. BN's dependence on batch statistics makes it ill-suited to very small batches or non-i.i.d. mini-batches, motivating alternatives: Layer Normalisation (normalises across features within each example, standard in transformers), Instance Normalisation (per-channel, popular in style transfer), and Group Normalisation (compromise between layer and instance norm).
Related terms: Regularisation, Dropout
Discussed in:
- Chapter 10: Training & Optimisation — Batch Normalisation
Also defined in: Textbook of AI