Weight Decay is a regularisation technique that penalises large weights by adding a term proportional to the squared norm of the weights to the loss function: $L_{reg} = L + (\lambda/2) \sum_i w_i^2$. The gradient of this penalty is simply $\lambda \mathbf{w}$, which at every update step pulls each weight slightly toward zero—hence the name. Weight decay is mathematically equivalent to L2 regularisation and has a Bayesian interpretation as a Gaussian prior on the weights.
In practice, weight decay is one of the oldest and most widely used regularisers. It discourages the network from relying on any single large weight, promoting smoother, more distributed representations. Typical values are $10^{-4}$ to $10^{-2}$, chosen by cross-validation or by reference to published recipes for similar architectures.
A subtle point concerns the interaction with adaptive optimisers. In the original Adam, L2 regularisation is implemented by adding the penalty to the loss, so the regularisation gradient gets scaled by the adaptive denominator—weakening the regularisation effect. AdamW, proposed by Loshchilov and Hutter (2019), decouples weight decay from the adaptive update: it applies weight decay directly to the parameters after the Adam step. This decoupled formulation produces better generalisation and has become the de facto standard for training transformers and large language models.
Related terms: Regularisation, Adam
Discussed in:
- Chapter 10: Training & Optimisation — Regularisation in Deep Learning
Also defined in: Textbook of AI