Batch normalisation centres and rescales activations, Textbook of AI

Subtract the batch mean, divide by the batch standard deviation, scale and shift back.

From the chapter: Chapter 10: Training & Optimisation

Transcript

Inside a deep network, activations drift. Some neurons fire huge values, others tiny. Their distributions shift during training. This was called internal covariate shift.

Batch normalisation. After a layer's linear transform, before the activation, do this for each unit:

Compute the mean of this unit's activations across the current batch. Compute the standard deviation across the current batch. Subtract the mean. Divide by the standard deviation. Add a small epsilon for stability.

Now the unit's activations have mean zero, variance one, batch-wise.

Optionally scale by a learnable parameter gamma and shift by a learnable parameter beta. The network can recover the original distribution if it needs to.

At test time, use running estimates of the mean and variance accumulated during training. The forward pass is deterministic.

Why does it help. Faster training, larger learning rates allowed, less sensitivity to initialisation, mild regularisation effect.

Variants. Layer norm. Normalise across features within one example, not across the batch. The default in transformers.

RMSNorm, used in Llama. Drop the mean subtraction, only divide by root mean square. Faster, comparable performance.

Group norm and instance norm have specific use cases. The principle is the same: keep activations on a manageable scale.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).