10.10 Gradient clipping and gradient noise
Even with a careful schedule, individual gradients can be very large. RNNs are notorious for exploding gradients; Transformers occasionally produce loss spikes from a single bad batch. Gradient clipping caps the magnitude of the update.
Norm clipping
Compute the global gradient norm $\|g\|_2$. If it exceeds a threshold $c$, rescale:
$$g \leftarrow g \cdot \min\!\left(1, \frac{c}{\|g\|_2}\right).$$
This preserves the direction of the gradient and only attenuates its magnitude. Pascanu, Mikolov and Bengio (2013) introduced this for RNNs to control the exploding gradient problem in BPTT.
Typical thresholds: $c = 1.0$ for Transformers, $c = 5.0$ for RNNs. The right value depends on architecture; monitor the actual gradient norms during a pilot run and set $c$ slightly above the typical max.
Value clipping
Clip each element individually: $g_i \leftarrow \mathrm{clip}(g_i, -c, c)$. This is more aggressive but distorts the gradient direction. Generally norm clipping is preferred.
Per-layer or per-parameter clipping
Some variants clip per layer or even per parameter. Useful when gradient magnitudes vary widely across the network, common in very deep models.
Gradient noise
Neelakantan et al. (2015) suggested adding Gaussian noise to gradients: $g \leftarrow g + \mathcal{N}(0, \sigma_t^2 I)$, with $\sigma_t = c/(1+t)^\gamma$. This is the analogue of simulated annealing: extra noise helps escape sharp local minima. Rarely used in modern practice, implicit SGD noise plus dropout typically suffices, but worth knowing about.