Noam Shazeer (2020)
arXiv:2002.05202.
URL: https://arxiv.org/abs/2002.05202
Abstract. A short ablation paper on gated linear units (GLU) in the Transformer feed-forward sub-layer. Replaces the standard two-layer MLP $\mathbf{W}_2 \sigma(\mathbf{W}_1 \mathbf{x})$ with a gated variant $\mathbf{W}_2 (\sigma(\mathbf{W}_1 \mathbf{x}) \odot \mathbf{V} \mathbf{x})$ for various $\sigma$. The Swish-activated variant SwiGLU consistently improves perplexity at matched parameter count. SwiGLU has since become the default feed-forward in nearly every frontier large language model, LLaMA, PaLM, Qwen, Claude, Gemini all use it.
Tags: transformers architecture
Cited in: