References

Root Mean Square Layer Normalization

Biao Zhang & Rico Sennrich (2019)

Advances in Neural Information Processing Systems 32.

URL: https://arxiv.org/abs/1910.07467

Abstract. Introduces RMSNorm, a simplified variant of LayerNorm that drops the mean-centring and the bias parameter. Rescales activations by their root-mean-square magnitude, applies a learned per-feature scale, and stops there. The mathematical motivation is that the mean-centring in LayerNorm is empirically less important than the rescaling, and removing it saves a reduction operation per layer. RMSNorm matches LayerNorm in quality while being slightly faster and is used in nearly every frontier LLM, LLaMA, PaLM, Qwen, Claude, Gemini.

Tags: transformers normalisation

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).