Vanishing Gradient, Glossary, Textbook of AI

The Vanishing Gradient Problem is a fundamental challenge in training deep neural networks. Because backpropagation computes gradients via the chain rule, multiplying many factors together as gradients flow backward through layers, the product can shrink exponentially if each factor is less than 1, or grow exponentially if each exceeds 1. Shrinking produces vanishing gradients; growing produces exploding gradients. Both are disastrous for training.

Vanishing gradients are especially severe with saturating activation functions like sigmoid and tanh, whose derivatives are small for large absolute inputs. In a deep network with sigmoid activations, gradients from the output layer can become effectively zero by the time they reach the first few layers, making it impossible for those early layers to learn. This was a major reason deep networks were considered impractical before the mid-2000s.

Several innovations addressed vanishing gradients. ReLU activations have gradient 1 for positive inputs, avoiding saturation. Residual connections provide direct paths for gradients to bypass many layers unchanged, the central innovation of ResNet that enabled training of networks with hundreds of layers. Batch normalisation stabilises the distribution of activations and thus gradients throughout training. Careful weight initialisation (Xavier/Glorot, Kaiming/He) sets initial scales to preserve gradient magnitudes across layers. LSTM and GRU gating mechanisms create "gradient highways" in recurrent networks, enabling learning of long-range dependencies. For exploding gradients, gradient clipping caps the gradient norm to prevent numerical instability. Together, these techniques have made training of very deep networks routine.

Interactive

LSTM cell and the constant-error carousel. The forget gate keeps gradients alive over long sequences.

Multiply many small derivatives and the gradient vanishes. Repeated multiplication of fractions less than one drives the gradient toward zero in deep networks.

Video

Related terms: Backpropagation, Residual Connection, ReLU, Batch Normalisation, LSTM

Discussed in:

Chapter 3: Calculus, The Chain Rule
Chapter 9: Neural Networks, Backpropagation

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.