Sequence Models: 12.7   The vanishing and exploding gradient problem

Dr Chris Paton

12.7 The vanishing and exploding gradient problem

The product of step-Jacobians $\prod_{j=k+1}^{t} J_j$ governs how a perturbation at step $k$ propagates forward to step $t$, and equivalently how a gradient signal from step $t$ propagates backward to step $k$. Bengio, Simard and Frasconi (1994) and later Pascanu, Mikolov and Bengio 2013 gave a formal analysis.

12.7.1 Spectral analysis

Suppose for simplicity the step-Jacobian is constant, $J_j = J$ (a reasonable approximation when the hidden states vary slowly). Then

$$\frac{\partial h_t}{\partial h_k} = J^{t - k}.$$

Decompose $J$ in its eigenbasis (assume diagonalisable for simplicity): $J = P \Lambda P^{-1}$ with $\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_H)$. Then

$$J^{t-k} = P \Lambda^{t-k} P^{-1}, \qquad \|J^{t-k}\| \approx \rho(J)^{t-k},$$

where $\rho(J) = \max_i |\lambda_i|$ is the spectral radius. Three regimes:

$\rho(J) > 1$: the Jacobian product grows exponentially. Gradients explode.
$\rho(J) < 1$: the product shrinks exponentially. Gradients vanish.
$\rho(J) = 1$: the product is bounded, but achieving and maintaining this exactly is a measure-zero condition.

For a $\tanh$ RNN, the step-Jacobian is $\mathrm{diag}(1 - \tanh^2 z) W_{hh}$. The diagonal entries are at most $1$, so the spectral radius is bounded above by $\rho(W_{hh})$, but this upper bound is rarely attained; the saturating non-linearity tends to drive eigenvalues down. As a result, vanilla RNNs are dominated by vanishing rather than exploding gradients.

12.7.2 Symptoms in practice

Vanishing gradients manifest as an inability to learn long-range dependencies. The network performs adequately on local statistics, bigrams, short phrases, but fails on tasks requiring memory of inputs more than 10 or 20 steps back. The classic synthetic test is the adding problem: read a sequence of 100 numbers, identified by a one-hot marker at two positions, and output the sum of the marked numbers. Vanilla RNNs cannot solve this for sequences longer than about 10 steps; LSTMs solve it for 1000 or more.

Exploding gradients manifest as numerical instability: loss values become NaN, weights blow up, training diverges. The standard fix is gradient clipping Pascanu, 2013: rescale the gradient if its norm exceeds a threshold $\theta$,

$$g \gets \begin{cases} g & \text{if } \|g\| \le \theta, \\ g \cdot \theta / \|g\| & \text{otherwise}. \end{cases}$$

Clipping is cheap, robust, and almost universally applied in RNN training. Typical $\theta$ values are 1, 5, or 10 depending on the task.

Clipping fixes explosion but does not fix vanishing; there is no analogous trick to "amplify" small gradients without amplifying noise. The architectural fix for vanishing is gating, to which we now turn.