The forget gate keeps gradients alive over long sequences.
From the chapter: Chapter 12: Sequence Models
Glossary: lstm, recurrent neural network, vanishing gradient
People: sepp hochreiter, jurgen schmidhuber
Transcript
A vanilla recurrent network forgets quickly. The LSTM cell, introduced by Hochreiter and Schmidhuber in 1997, fixes this with a clever pipe through time.
The blue line at the top is the cell state, the memory pipe. Three gates regulate how it changes.
The forget gate f controls how much of the previous memory survives. The input gate i, multiplied by the candidate g, controls how much new information is written.
The output gate o decides how much of the cell state leaks out as the visible hidden state h.
When the forget gate stays near one, the cell state survives across all fifty time steps. The gradient, in red, barely shrinks. This is the constant error carousel.
When the forget gate drops toward zero, memory leaks out within a few steps. The gradient collapses, just like a vanilla RNN.
That is the trick. Multiplicative gates keep gradients alive across hundreds of steps, opening the door to long-range language modelling.