Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, & Tieyan Liu (2020)
International Conference on Machine Learning.
URL: https://arxiv.org/abs/2002.04745
Abstract. Identifies the cause of the warm-up requirement in the original Transformer recipe and resolves it. Shows that the post-norm Transformer (LayerNorm after the residual connection, as in Vaswani et al. 2017) has a gradient that scales linearly with depth, requiring careful learning-rate warm-up to avoid divergence. Pre-norm Transformers (LayerNorm before the residual stream, applied to the residual branch's input) have well-conditioned gradients at initialisation and train stably without warm-up. The pre-norm Transformer is the architecture used in nearly every modern large language model.
Tags: transformers optimisation architecture
Cited in: