9.11 Vanishing and exploding gradients
Depth amplifies whatever scaling errors propagate from layer to layer, and it does so multiplicatively. Each layer of a deep network passes its gradient back to the layer beneath it through a Jacobian, a matrix of partial derivatives, and the gradient that finally reaches the first layer is essentially the product of all those Jacobians stacked end to end. Products are unforgiving. If the typical Jacobian shrinks the gradient by a factor a little smaller than one, the product over twenty or fifty layers is essentially zero, and the early layers stop receiving any meaningful learning signal at all. They sit frozen at their random initial values while the layers near the output do all the work. If the typical Jacobian instead expands the gradient by a factor a little larger than one, the product explodes: gradients become so large that a single optimisation step blows the weights to infinity, and the loss prints NaN within a handful of iterations. Either way, depth turns small per-layer errors into catastrophic global ones, and the network refuses to learn. This was the central technical obstacle to deep learning before roughly 2012, and almost every architectural innovation of the last decade, ReLU, careful initialisation, batch and layer normalisation, residual connections, gated recurrent cells, gradient clipping, exists in some sense to defeat it.
The chapter has been building towards this point. Section 9.10 (initialisation) chose the variance of the weights at layer 0 so that activations have a sensible scale on the very first forward pass. The current section shows what depth then does to that scale during the backward pass: it compounds, layer after layer, until the gradient at the input is many orders of magnitude smaller (or larger) than the gradient at the output. Sections 9.12 (regularisation) and 9.13 (normalisation) cover the two main fixes that operate on the activations themselves. The remainder of the chapter, and Chapter 11 onwards, shows how modern architectures, ResNets, LSTMs, transformers, sidestep the problem at the level of the network structure rather than just the optimiser.
A simple model of the problem
Consider an $L$-layer feed-forward network. The forward pass at layer $\ell$ produces a pre-activation $\mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}$ and an activation $\mathbf{a}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)})$. To update a weight in the first layer we need $\partial \mathcal{L} / \partial \mathbf{W}^{(1)}$. By the chain rule, that gradient is a long product:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} \;=\; \underbrace{\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(L)}}}_{\text{output term}} \cdot \underbrace{\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{a}^{(L-1)}} \cdot \frac{\partial \mathbf{a}^{(L-1)}}{\partial \mathbf{a}^{(L-2)}} \cdots \frac{\partial \mathbf{a}^{(2)}}{\partial \mathbf{a}^{(1)}}}_{L-1 \text{ Jacobian factors}} \cdot \underbrace{\frac{\partial \mathbf{a}^{(1)}}{\partial \mathbf{W}^{(1)}}}_{\text{local}}$$
Each Jacobian $\partial \mathbf{a}^{(\ell+1)} / \partial \mathbf{a}^{(\ell)}$ is a matrix whose entries depend on the weights $\mathbf{W}^{(\ell+1)}$ and the activation derivatives $\sigma'(\mathbf{z}^{(\ell+1)})$. To see what depth does, replace each Jacobian with a single scalar magnitude $\alpha$, call it the typical contraction factor of one layer. Then the gradient at the first layer scales as $\alpha^{L-1}$ relative to the gradient at the output.
Now plug in numbers. Suppose $\alpha \approx 0.5$, which is what you get from a sigmoid network with sensibly initialised weights. In a 20-layer network the gradient that reaches the first layer is
$$0.5^{20} \;=\; \frac{1}{1\,048\,576} \;\approx\; 9.5 \times 10^{-7}$$
That is roughly one-millionth of the output gradient. The output layer gets a healthy push, the first layer gets nothing, and the first layer effectively never updates. This is vanishing gradient in its purest form.
Now suppose $\alpha \approx 1.5$, which is what you get from sigmoid (or any activation) with weights initialised too generously. In a 20-layer network the gradient at the first layer scales as
$$1.5^{20} \;\approx\; 3327$$
The first layer receives gradients three thousand times bigger than the output layer. One step of gradient descent with a normal learning rate moves the early weights by a colossal amount, the next forward pass produces wildly oversized activations, the next backward pass produces even larger gradients, and within a few iterations the weights overflow to infinity. The loss prints NaN. This is exploding gradient.
The exponential is what makes both failure modes so brutal. With $\alpha = 0.99$, only one per cent below unity, a 1000-layer network still suffers $0.99^{1000} \approx 4 \times 10^{-5}$, well into vanishing territory. With $\alpha = 1.01$, one per cent above unity, the same network has $1.01^{1000} \approx 21\,000$, fully exploded. The window of "stable" $\alpha$ is razor-thin in deep networks, and small biases in the per-layer multiplier translate into enormous biases in the global gradient. Worse, the per-layer multiplier depends on the weights and on the input data and on the activation function, so it is not under direct control: you have to engineer it indirectly via initialisation, activation choice, and architecture.
Why sigmoids and tanh are particularly bad
The classical activation functions of the 1980s and 1990s, the logistic sigmoid and the hyperbolic tangent, make the situation much worse than the simple-multiplier analysis suggests, because their derivatives shrink even when the network is otherwise well behaved.
The logistic sigmoid is $\sigma(z) = 1 / (1 + e^{-z})$. Its derivative is
$$\sigma'(z) \;=\; \sigma(z)\,(1 - \sigma(z))$$
This derivative reaches its maximum at $z = 0$, where $\sigma(0) = 0.5$ and so $\sigma'(0) = 0.5 \times 0.5 = 0.25$. Anywhere else the derivative is smaller. For $|z| > 5$ the sigmoid is essentially flat: $\sigma(5) \approx 0.9933$, so $\sigma'(5) \approx 0.9933 \times 0.0067 \approx 0.0066$. A saturated sigmoid passes through almost no gradient at all.
In a network with sigmoid activations, every backward step through a layer multiplies the gradient by a factor of at most $0.25$, and typically much less, because most pre-activations are not at exactly zero. After ten layers the gradient is multiplied by at most
$$0.25^{10} \;=\; 4^{-10} \;\approx\; 9.5 \times 10^{-7}$$
That is a millionfold attenuation, before we have even accounted for the weight matrix itself. This is the activation-function contribution to vanishing gradients, and it is the main reason sigmoid networks could not be made deep before ReLU.
The hyperbolic tangent $\tanh$ is a little better. Its derivative $\tanh'(z) = 1 - \tanh^2(z)$ has maximum value $1$ at $z = 0$, so an idealised tanh layer with pre-activations centred on zero passes the gradient through unattenuated. But tanh saturates just as quickly as sigmoid: by $|z| = 3$ the derivative is below $0.01$. In practice tanh networks vanish almost as badly as sigmoid networks once depth exceeds about ten layers.
ReLU breaks the pattern. Its derivative is
$$\mathrm{ReLU}'(z) \;=\; \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \le 0 \end{cases}$$
About half the units are off for any given input, they do not pass any gradient at all, but the units that are active pass the gradient through unscaled. A 50-layer ReLU stack does not multiply gradients by $0.25^{50}$; instead it routes them through whatever subset of units happens to be active, with each active unit contributing a factor of exactly one. The activation function itself stops attenuating, and the only remaining source of vanishing is the weight matrices, which careful initialisation can address. This is the single most important reason ReLU revolutionised deep learning around 2011–2012: it removes the activation function from the list of things multiplicatively shrinking the gradient.
Concrete worked example: gradient through a 5-layer sigmoid stack
Consider a tiny network with $L = 5$ fully connected layers, each of width 4, all using sigmoid activations. A single training input $\mathbf{x}$ flows through the stack and produces a scalar loss. We will track how the gradient magnitude shrinks layer by layer under three different initialisation schemes.
Scheme A, naïve initialisation, $\mathcal{N}(0, 1)$. Weights are drawn from a standard normal. Pre-activations $\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x}$ are sums of four products of unit-variance terms, so their standard deviation is roughly $2$. Most entries of $\mathbf{z}^{(1)}$ are well outside $[-1, 1]$; a typical entry might be $z = 2.5$, where $\sigma(2.5) \approx 0.924$ and $\sigma'(2.5) \approx 0.070$. The sigmoid is saturating already at layer 1. By layer 2 the activations $\mathbf{a}^{(1)}$ are mostly close to 0 or 1, the next pre-activation has even larger spread, and $\sigma'$ collapses below $0.01$. Continuing through five layers, the per-layer derivative multipliers might look like $0.07,\,0.005,\,0.002,\,0.001,\,0.0008$, and the cumulative product of activation derivatives along the chain is around $5.6 \times 10^{-13}$. The gradient at layer 1 is essentially zero relative to the loss-side gradient. This is the regime where sigmoid networks "look broken": they train layer 5 fine and leave layer 1 at its random initial values forever.
Scheme B, Xavier (Glorot) initialisation, $\mathcal{N}(0, 1/n_{\text{in}})$. Weights are scaled so that the variance of $\mathbf{z}^{(\ell)}$ stays close to one across layers. With width 4, weights are drawn from $\mathcal{N}(0, 0.25)$. Pre-activations now have standard deviation around $1$, so a typical entry is $z \approx 0.7$, where $\sigma'(0.7) \approx 0.225$. The sigmoid is operating in its most linear region, and the derivative multiplier per layer is close to its theoretical maximum of $0.25$. After five layers the cumulative activation-derivative product is approximately
$$0.225^5 \;\approx\; 5.8 \times 10^{-4}$$
That is a roughly 1700-fold attenuation, which is vastly better than $5.6 \times 10^{-13}$ but still represents a serious gradient shrink. Sigmoid networks even with Xavier initialisation become hard to train past about ten layers.
Scheme C, He initialisation with ReLU activation, $\mathcal{N}(0, 2/n_{\text{in}})$. Now we replace sigmoid with ReLU and use He initialisation, $\mathcal{N}(0, 0.5)$. The ReLU derivative is exactly $1$ for active units and exactly $0$ for inactive ones. About half the units in each layer are active, so the expected gradient multiplier per layer due to the activation function is $0.5$, but the half that pass the gradient pass it unscaled. Combined with He initialisation, which targets unit variance for the active half, the effective multiplier is close to $1$ in expectation. After five layers the cumulative gradient retains most of its magnitude.
The three schemes side by side:
| Layer | Sigmoid + $\mathcal{N}(0,1)$ | Sigmoid + Xavier | ReLU + He |
|---|---|---|---|
| 5 (output side) | $1$ | $1$ | $1$ |
| 4 | $0.07$ | $0.22$ | $\sim 1$ |
| 3 | $4 \times 10^{-4}$ | $0.05$ | $\sim 1$ |
| 2 | $1 \times 10^{-6}$ | $0.011$ | $\sim 1$ |
| 1 | $1 \times 10^{-9}$ | $2.5 \times 10^{-3}$ | $\sim 1$ |
In the worst-case sigmoid scheme, the gradient at the input layer is roughly nine orders of magnitude smaller than the gradient at the output, about $10^{-9}$ versus $1$. That is the textbook vanishing gradient. With Xavier the gap closes to about three orders of magnitude, which is bad but trainable for shallow networks. With ReLU plus He the gap is essentially closed. The numbers here are illustrative, actual values depend on the input data and the specific weights, but the pattern is robust across real networks: changing the activation and the initialisation moves the ratio of front-end to back-end gradient magnitudes from "trainable" to "not".
The arithmetic also shows why ReLU plus He initialisation became the default for feed-forward networks almost overnight after the relevant papers (Glorot and Bengio 2010; He et al. 2015): the only remaining source of gradient attenuation is the matrix factor, and that is controllable.
Exploding gradients in RNNs
Recurrent networks face an extreme version of the same problem because they apply the same weight matrix at every time step. A vanilla RNN updates its hidden state via
$$\mathbf{h}_t \;=\; \tanh(\mathbf{W} \mathbf{h}_{t-1} + \mathbf{U} \mathbf{x}_t + \mathbf{b})$$
When we back-propagate through time over $T$ steps, the gradient with respect to early time steps involves $T$ copies of the same Jacobian, essentially $\mathbf{W}^T$ (with activation derivatives sandwiched in). The crucial number is the spectral radius $\rho$ of $\mathbf{W}$: the largest eigenvalue magnitude, equivalently the largest singular value if $\mathbf{W}$ is symmetric. To leading order, the gradient through $T$ time steps grows or shrinks as $\rho^T$.
Take $T = 100$ time steps, a modest sentence length.
If $\rho = 1.1$, the recurrent matrix is just slightly expansive,
$$1.1^{100} \;\approx\; 13\,780$$
The gradient at $t = 1$ is about fourteen thousand times the gradient at $t = 100$. One backward pass produces enormous weight updates, the next forward pass overflows, and training crashes. This is exploding gradient in RNNs, and it is responsible for the "loss is NaN at iteration 7" failure mode that anyone who has trained vanilla RNNs has met.
If $\rho = 0.9$, the recurrent matrix is just slightly contractive,
$$0.9^{100} \;\approx\; 2.66 \times 10^{-5}$$
The gradient at $t = 1$ is essentially zero. The network cannot learn to associate the loss at the end of a sentence with anything that happened more than a few words back. This is why vanilla RNNs cannot learn long-range dependencies: the gradient signal physically cannot reach the early time steps.
The interval between $\rho = 0.9$ and $\rho = 1.1$ is the entire trainable regime, and even within that regime the gradient is biased exponentially with sequence length. Forcing a recurrent matrix to keep $\rho$ exactly at one is the central problem of recurrent network design. Two answers emerged. The cheap answer is gradient clipping: rescale any gradient whose norm exceeds a threshold $\tau$ by $\tau / \|\mathbf{g}\|$. This stops explosions but does nothing for vanishing. The structural answer is gating: design the recurrent cell so that information can flow through time along a path whose Jacobian is identity. The LSTM (Hochreiter and Schmidhuber 1997) and GRU (Cho et al. 2014) both implement this idea: the cell state in an LSTM is updated additively, $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$, and when the forget gate $\mathbf{f}_t$ is close to one the gradient flows through the cell state with multiplier near one regardless of $T$. This "constant error carousel" is what lets LSTMs span long contexts and was the dominant sequence model from 1997 until the transformer era.
Modern fixes: a survey
A handful of complementary techniques together largely solve the vanishing and exploding gradient problem in modern networks.
- ReLU and friends. Replace saturating activations like sigmoid and tanh with non-saturating ones like ReLU, leaky ReLU, GELU, or SiLU. The activation function itself stops attenuating gradients, leaving only the weight matrices to worry about.
- He / Xavier initialisation. Scale the variance of initial weights by $1/n_{\text{in}}$ (Xavier, for tanh) or $2/n_{\text{in}}$ (He, for ReLU) so that the variance of activations and gradients stays approximately constant across layers. This was section 9.10's topic and makes the per-layer multiplier close to one in expectation.
- Batch normalisation and layer normalisation. Normalise the activations at each layer to have mean zero and variance one (across the mini-batch for batch norm, across the feature dimension for layer norm). This prevents scale drift from compounding across layers and makes optimisation far more robust. Layer norm in particular is used in every transformer. Section 9.13 covers this in detail.
- Residual connections. Add the input back to the output of each block: $\mathbf{a}^{(\ell+1)} = \mathbf{a}^{(\ell)} + F(\mathbf{a}^{(\ell)})$. The Jacobian of this map is $I + \partial F / \partial \mathbf{a}^{(\ell)}$, so the gradient now has an additive identity path that propagates without compounding small factors. ResNet (He et al. 2015) used this trick to make 152-layer networks trainable, and every modern transformer relies on it.
- Gradient clipping. Clip the global gradient norm to a fixed maximum, typically $1.0$ for transformers and $0.25$ for RNNs: $\mathbf{g} \leftarrow \mathbf{g} \cdot \min(1, \tau / \|\mathbf{g}\|)$. Cheap insurance against explosions but does not address vanishing.
- Highway networks and gated residuals. A soft version of residual connections in which a learned gate $\mathbf{T}$ blends the residual with the transform: $\mathbf{a}^{(\ell+1)} = \mathbf{T} \odot F(\mathbf{a}^{(\ell)}) + (1 - \mathbf{T}) \odot \mathbf{a}^{(\ell)}$. Largely subsumed by simple residuals in practice but historically important.
- LSTMs and GRUs. Gated recurrent cells that route gradient information through an additive cell state, giving a constant-error-carousel path that bypasses multiplicative compounding through time.
- Skip connections in U-Nets, transformers, and diffusion models. The same residual principle applied across longer ranges, for example, U-Net skip connections from encoder to decoder, or the residual stream that runs the full depth of a transformer. Modern segmentation, generative, and language models all rely on these long-range identity paths.
These mitigations stack: a modern transformer uses GELU activations (1), careful initialisation (2), layer normalisation (3), residual connections (4), and gradient clipping (5) all at once. The cumulative effect is that 100-layer networks now train without difficulty, where in 2010 thirty layers was considered the practical limit.
Why ResNet was such a turning point
Before ResNet (He et al. 2015), serious efforts to train networks beyond about thirty layers consistently failed: not because of overfitting, but because the optimisation simply could not move the deep early layers. Researchers had observed the puzzling fact that adding more layers to an already-trained network often made the training error worse, which is impossible in the limit of perfect optimisation since the extra layers could just learn the identity function. The diagnosis was that gradient descent could not actually find the identity solution through the multiplicative gradient bottleneck.
ResNet's residual connection is a structural fix. Instead of asking each block to compute the desired transformation $H(\mathbf{a})$, ask it to compute the residual $F(\mathbf{a}) = H(\mathbf{a}) - \mathbf{a}$, and define the block output as $\mathbf{a}_{\text{out}} = \mathbf{a} + F(\mathbf{a})$. The two formulations are mathematically equivalent in expressivity, but they are radically different from an optimisation perspective. The gradient of the block with respect to its input is $I + \partial F / \partial \mathbf{a}$, an identity matrix plus a learned term. When we back-propagate through $L$ such blocks the total Jacobian is
$$\prod_{\ell=1}^{L} \left( I + \frac{\partial F_\ell}{\partial \mathbf{a}^{(\ell-1)}} \right)$$
Expanded, this is a sum of $2^L$ terms, one of which is the bare identity matrix. The gradient computation has effectively been turned from "long product of Jacobians" into "long sum of weighted Jacobians", and as long as one term in the sum is large the gradient does not vanish. The identity path gives the gradient a bypass.
The empirical effect was dramatic. ResNet-152, 152 layers, trained end to end with vanilla SGD on ImageNet and won the 2015 ILSVRC. Follow-up work trained 1001-layer ResNets that outperformed shallower ones. The same trick is now ubiquitous: every transformer block has residuals around its attention and feed-forward sub-layers, every U-Net concatenates encoder features into the decoder, every diffusion U-Net has them, every modern generative model relies on them. Residual connections are the single most important architectural idea of the deep-learning era for solving the gradient problem.
Practical diagnostic
How do you tell, on a real run, whether your network is suffering from vanishing or exploding gradients?
- Print gradient norms layer by layer. Most frameworks expose per-parameter
.gradtensors after a backward pass. Compute the norm of each layer's gradient and log them. If the norm at layer 1 is $10^{-8}$ times the norm at layer $L$, you have classic vanishing. If layer 1's norm is hundreds of times layer $L$'s, you have exploding. - Watch the loss in the first few iterations. If the loss is constant, not decreasing, not increasing, within ten or twenty steps, vanishing is the most likely culprit: the early layers are not learning, so the network is essentially a shallow head sitting on a fixed random feature extractor.
- Watch for
NaN. If the loss becomesNaNwithin a few iterations, you have explosion. Activations, gradients, or weights overflowed to infinity. - Check intermediate accuracy. If training accuracy improves but stays well below random for the front half of a deep network's predictions (in tasks where you can probe intermediate layers), front-end vanishing is likely.
- Use gradient clipping defensively. In any RNN or transformer, set
clip_grad_norm_(params, 1.0)(or $0.25$ for vanilla RNNs) before the optimiser step. It costs almost nothing and prevents the run from crashing on a single rare bad batch. - Try a deeper layer-norm or residual-connection budget. If a network is hard to train at depth, adding layer norm before each block and ensuring every block has a residual connection is usually sufficient.
These diagnostics take five minutes to add and save days of guessing.
What you should take away
- Depth multiplies gradients. A small per-layer error in the gradient scaling factor compounds exponentially with depth, producing either vanishing or exploding gradients. Both prevent learning, and the trainable window is narrow.
- Saturating activations make vanishing worse. Sigmoid and tanh have small derivatives outside a narrow active region; their derivatives multiplied across many layers go to zero quickly. ReLU has derivative exactly one for active units and was the breakthrough that allowed deep feed-forward networks to train.
- Initialisation sets the per-layer multiplier. Xavier and He initialisation choose weight variances so that the typical multiplier per layer is close to one, putting the network in the trainable regime from the very first step.
- Residual connections give the gradient a bypass. Adding identity paths turns the gradient from a long product into a long sum, so it cannot vanish below a constant fraction of the upstream norm. This is why ResNet, transformers, U-Nets, and diffusion models all use residuals.
- In RNNs the problem becomes spectral-radius-of-the-recurrent-matrix. Vanilla RNNs cannot stay near $\rho = 1$ over long sequences. LSTMs and GRUs solve this with additive gated cell states; gradient clipping protects against explosions; transformers sidestep recurrence entirely by paying the $O(T^2)$ cost of attention.