Repeated multiplication of fractions less than one drives the gradient toward zero in deep networks.
From the chapter: Chapter 12: Sequence Models
Glossary: vanishing gradient, vanishing gradient problem
Transcript
A deep network. Twenty layers, each with a sigmoid activation.
The derivative of the sigmoid is at most one quarter. Usually much less.
Backpropagation multiplies these derivatives together, layer by layer.
One quarter. Times one quarter. Times one quarter. After ten layers, the product is around ten to the minus six.
After twenty, ten to the minus twelve. The gradient at the bottom of the network is nothing. It cannot move the early weights.
Watch a histogram of gradient magnitudes through the layers. Big at the top. Tiny at the bottom. A waterfall, drying up.
This is vanishing gradients, and for years it made very deep networks essentially untrainable.
The opposite happens too. Multiply many numbers larger than one and you get explosion. Gradients become inf. Loss becomes nan.
Three things were invented to fix this. ReLU, whose derivative is exactly one in the active region, breaks the multiplication of fractions. Residual connections add an identity shortcut, so the gradient has a direct path. Layer normalisation keeps the activations and gradients in a manageable range.
Together they made it possible to train networks one hundred layers deep. Without them, deep learning could not work.