The vanishing-gradient problem, identified by Sepp Hochreiter in his 1991 diploma thesis, is the phenomenon that gradients of the loss with respect to weights in early layers of a deep feed-forward network, or with respect to weights at early timesteps of an unrolled recurrent network, shrink exponentially with depth or temporal extent. Naive training of deep or recurrent networks consequently fails: the early layers receive almost no learning signal.
The mathematical root is the chain rule: each layer's gradient contribution is a product of Jacobians, and if those Jacobians have eigenvalues consistently less than 1, the product decays exponentially. The dual problem of exploding gradients occurs when the eigenvalues are consistently greater than 1, leading to numerical overflow or training instability.
The problem motivated the LSTM architecture (Hochreiter and Schmidhuber, 1997), which introduces a "constant error carousel" preserving gradient flow through additive updates to the cell state. In feed-forward networks, the problem is mitigated by: ReLU activations (Nair and Hinton, 2010), whose gradient is exactly 1 in the positive regime; batch normalisation (Ioffe and Szegedy, 2015), which keeps activations in a stable range; residual connections (He et al., 2015), which let gradients flow through identity shortcuts; better initialisation (Glorot 2010, He 2015), which sets the initial scale of weights to preserve gradient norms across layers. These innovations together made the training of networks with hundreds or thousands of layers possible, and were prerequisites for modern deep-learning success.
Video
Related terms: LSTM, sepp-hochreiter, Residual Connection, Batch Normalisation, ReLU
Discussed in:
- Chapter 9: Neural Networks, Training Neural Networks