Noam Shazeer (2020), References, Textbook of AI

Noam Shazeer (2020)

arXiv:2002.05202.

URL: https://arxiv.org/abs/2002.05202

Abstract. A short ablation paper on gated linear units (GLU) in the Transformer feed-forward sub-layer. Replaces the standard two-layer MLP $\mathbf{W}_2 \sigma(\mathbf{W}_1 \mathbf{x})$ with a gated variant $\mathbf{W}_2 (\sigma(\mathbf{W}_1 \mathbf{x}) \odot \mathbf{V} \mathbf{x})$ for various $\sigma$. The Swish-activated variant SwiGLU consistently improves perplexity at matched parameter count. SwiGLU has since become the default feed-forward in nearly every frontier large language model, LLaMA, PaLM, Qwen, Claude, Gemini all use it.

Tags: transformers architecture

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

GLU Variants Improve Transformer