Ilya Loshchilov & Frank Hutter (2017)
arXiv.
DOI: https://doi.org/10.48550/arxiv.1711.05101
Abstract. Shows that the L2 penalty in Adam interacts badly with adaptive scaling, and proposes AdamW, applying weight decay directly to parameters after the adaptive update, which has become the standard optimiser for transformer training.
Tags: optimisation adamw weight-decay
Cited in: