Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, & Tieyan Liu (2020), References, Textbook of AI

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, & Tieyan Liu (2020)

International Conference on Machine Learning.

URL: https://arxiv.org/abs/2002.04745

Abstract. Identifies the cause of the warm-up requirement in the original Transformer recipe and resolves it. Shows that the post-norm Transformer (LayerNorm after the residual connection, as in Vaswani et al. 2017) has a gradient that scales linearly with depth, requiring careful learning-rate warm-up to avoid divergence. Pre-norm Transformers (LayerNorm before the residual stream, applied to the residual branch's input) have well-conditioned gradients at initialisation and train stably without warm-up. The pre-norm Transformer is the architecture used in nearly every modern large language model.

Tags: transformers optimisation architecture

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

On Layer Normalization in the Transformer Architecture