Weight Decay is a regularisation technique that penalises large weights by adding a term proportional to the squared norm of the parameters to the loss function:
$$\mathcal{L}_{\text{reg}}(\mathbf{w}) = \mathcal{L}(\mathbf{w}) + \frac{\lambda}{2} \|\mathbf{w}\|_2^2 = \mathcal{L}(\mathbf{w}) + \frac{\lambda}{2} \sum_i w_i^2.$$
The gradient of the penalty is simply $\lambda \mathbf{w}$, so during gradient descent every weight is, in addition to the usual loss gradient, pulled by a small fraction $\lambda$ toward zero at each step, hence the name decay. With learning rate $\eta$, the update becomes
$$\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla \mathcal{L}(\mathbf{w}) - \eta \lambda \mathbf{w} = (1 - \eta \lambda)\,\mathbf{w} - \eta \nabla \mathcal{L}(\mathbf{w}).$$
For pure SGD without momentum or adaptive scaling, this is mathematically equivalent to $L_2$ regularisation. There is also a Bayesian interpretation: weight decay corresponds to a Gaussian prior $\mathcal{N}(\mathbf{0}, 1/\lambda \cdot \mathbf{I})$ on the weights, with the regularised loss being the negative log posterior.
Why it works
Weight decay is one of the oldest and most widely used regularisers in machine learning, dating to the 1980s neural-network literature (Krogh & Hertz, 1992) and the contemporaneous ridge regression of statistics (Hoerl & Kennard, 1970). It works because:
- It discourages reliance on any single large weight. Predictions become smoother, more distributed, and less brittle to small input perturbations.
- It improves conditioning of the optimisation problem. Adding $\lambda \mathbf{I}$ to the Hessian shifts all eigenvalues up, reducing the condition number and stabilising training.
- It tightens generalisation bounds. Norm-based generalisation bounds (Bartlett, 1998; Neyshabur et al., 2015) are sharper for low-norm hypotheses, formalising the intuition.
Typical values of $\lambda$ fall in $[10^{-6}, 10^{-2}]$, chosen by cross-validation or by reference to recipes for similar architectures. The setting interacts with batch size, learning rate, and training duration in non-obvious ways; in modern practice, the hyperparameter that is actually tuned is often the effective decay $\eta\lambda$ rather than $\lambda$ alone.
AdamW: decoupled weight decay
A subtle issue arose with the rise of adaptive optimisers such as Adam. In the original Adam, $L_2$ regularisation is implemented by adding $\lambda \mathbf{w}$ to the gradient before the adaptive moment estimates are computed. The update therefore divides the regularisation term by the per-parameter denominator $\sqrt{\hat v_t} + \varepsilon$, weakening the regularisation effect for parameters with large historical gradients, exactly the parameters one might most want to keep small.
AdamW, proposed by Loshchilov and Hutter (Decoupled Weight Decay Regularization, ICLR 2019), decouples weight decay from the adaptive update by applying it directly to the parameters after the Adam step:
$$\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \cdot \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} - \eta \lambda \mathbf{w}_t.$$
The decoupled formulation produces meaningfully better generalisation across a wide range of vision and language tasks, and AdamW has displaced Adam as the default optimiser for training Transformers and large language models. The lesson, that weight decay and $L_2$ regularisation are not equivalent under adaptive optimisers, is one of the cleanest examples of how subtle implementation details matter in deep learning.
Related terms: Regularisation, Adam, Stochastic Gradient Descent, Overfitting
Discussed in:
- Chapter 6: ML Fundamentals, Regularisation