Ilya Loshchilov & Frank Hutter (2017), References, Textbook of AI

Ilya Loshchilov & Frank Hutter (2017)

arXiv.

DOI: https://doi.org/10.48550/arxiv.1711.05101

Abstract. Shows that the L2 penalty in Adam interacts badly with adaptive scaling, and proposes AdamW, applying weight decay directly to parameters after the adaptive update, which has become the standard optimiser for transformer training.

Tags: optimisation adamw weight-decay

Cited in:

Chapter 6: ML Fundamentals

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Decoupled Weight Decay Regularization