The Vanishing Gradient Problem is a fundamental challenge in training deep neural networks. Because backpropagation computes gradients via the chain rule—multiplying many factors together as gradients flow backward through layers—the product can shrink exponentially if each factor is less than 1, or grow exponentially if each exceeds 1. Shrinking produces vanishing gradients; growing produces exploding gradients. Both are disastrous for training.
Vanishing gradients are especially severe with saturating activation functions like sigmoid and tanh, whose derivatives are small for large absolute inputs. In a deep network with sigmoid activations, gradients from the output layer can become effectively zero by the time they reach the first few layers, making it impossible for those early layers to learn. This was a major reason deep networks were considered impractical before the mid-2000s.
Several innovations addressed vanishing gradients. ReLU activations have gradient 1 for positive inputs, avoiding saturation. Residual connections provide direct paths for gradients to bypass many layers unchanged—the central innovation of ResNet that enabled training of networks with hundreds of layers. Batch normalisation stabilises the distribution of activations and thus gradients throughout training. Careful weight initialisation (Xavier/Glorot, Kaiming/He) sets initial scales to preserve gradient magnitudes across layers. LSTM and GRU gating mechanisms create "gradient highways" in recurrent networks, enabling learning of long-range dependencies. For exploding gradients, gradient clipping caps the gradient norm to prevent numerical instability. Together, these techniques have made training of very deep networks routine.
Related terms: Backpropagation, Residual Connection, ReLU, Batch Normalisation, LSTM
Discussed in:
- Chapter 3: Calculus — The Chain Rule
- Chapter 9: Neural Networks — Backpropagation
Also defined in: Textbook of AI