References

Decoupled Weight Decay Regularization

Ilya Loshchilov & Frank Hutter (2017)

arXiv.

DOI: https://doi.org/10.48550/arxiv.1711.05101

Abstract. Shows that the L2 penalty in Adam interacts badly with adaptive scaling, and proposes AdamW, applying weight decay directly to parameters after the adaptive update, which has become the standard optimiser for transformer training.

Tags: optimisation adamw weight-decay

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).