Glossary

Layer Normalisation

Layer normalisation, introduced by Jimmy Ba, Jamie Kiros and Geoffrey Hinton in 2016, normalises activations across features within each example, rather than across examples within each batch (as batch normalisation does). For a vector of activations $x \in \mathbb{R}^d$ within a single example,

$$\mu = \frac{1}{d} \sum_{i=1}^d x_i, \quad \sigma^2 = \frac{1}{d} \sum_{i=1}^d (x_i - \mu)^2$$

$$\mathrm{LayerNorm}(x)_i = \gamma_i \cdot \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta_i$$

where $\gamma, \beta \in \mathbb{R}^d$ are learnable affine parameters and $\epsilon \approx 10^{-5}$ ensures numerical stability.

Layer norm has three advantages over batch norm: Independent of batch size, works correctly for batch size 1, useful for online learning and for sequence-by-sequence inference; Simpler distributed training, no need to synchronise batch statistics across GPUs; Same train/inference behaviour, no running averages to maintain.

These properties make layer norm the natural choice for Transformers, where every block has a layer-norm before or after each attention/feed-forward sub-layer. Pre-norm (norm before sub-layer, popularised by GPT-2) trains more stably than post-norm (norm after, the original Transformer formulation).

RMSNorm (Zhang and Sennrich, 2019) drops the mean centring and uses only $\mathrm{RMS}(x) = \sqrt{\tfrac{1}{d} \sum_i x_i^2}$:

$$\mathrm{RMSNorm}(x)_i = \gamma_i \cdot \frac{x_i}{\sqrt{\mathrm{RMS}(x)^2 + \epsilon}}.$$

RMSNorm is faster, has no learnable bias, and matches LayerNorm's quality on most tasks. LLaMA, PaLM and most modern LLMs use RMSNorm.

Related terms: Batch Normalisation, Transformer, geoffrey-hinton

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).