Layer Normalisation, Glossary, Textbook of AI

Layer normalisation, introduced by Jimmy Ba, Jamie Kiros and Geoffrey Hinton in 2016, normalises activations across features within each example, rather than across examples within each batch (as batch normalisation does). For a vector of activations $x \in \mathbb{R}^d$ within a single example,

$$\mu = \frac{1}{d} \sum_{i=1}^d x_i, \quad \sigma^2 = \frac{1}{d} \sum_{i=1}^d (x_i - \mu)^2$$

$$\mathrm{LayerNorm}(x)_i = \gamma_i \cdot \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta_i$$

where $\gamma, \beta \in \mathbb{R}^d$ are learnable affine parameters and $\epsilon \approx 10^{-5}$ ensures numerical stability.

Layer norm has three advantages over batch norm: Independent of batch size, works correctly for batch size 1, useful for online learning and for sequence-by-sequence inference; Simpler distributed training, no need to synchronise batch statistics across GPUs; Same train/inference behaviour, no running averages to maintain.

These properties make layer norm the natural choice for Transformers, where every block has a layer-norm before or after each attention/feed-forward sub-layer. Pre-norm (norm before sub-layer, popularised by GPT-2) trains more stably than post-norm (norm after, the original Transformer formulation).

RMSNorm (Zhang and Sennrich, 2019) drops the mean centring and uses only $\mathrm{RMS}(x) = \sqrt{\tfrac{1}{d} \sum_i x_i^2}$:

$$\mathrm{RMSNorm}(x)_i = \gamma_i \cdot \frac{x_i}{\sqrt{\mathrm{RMS}(x)^2 + \epsilon}}.$$

RMSNorm is faster, has no learnable bias, and matches LayerNorm's quality on most tasks. LLaMA, PaLM and most modern LLMs use RMSNorm.

Related terms: Batch Normalisation, Transformer, geoffrey-hinton

Discussed in:

Chapter 9: Neural Networks, Training Neural Networks

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).