Glossary

LSTM

Long Short-Term Memory (LSTM), introduced by Hochreiter and Schmidhuber in 1997, is a recurrent neural network architecture designed to address the vanishing gradient problem of simple RNNs. The central innovation is a dedicated cell state $\mathbf{c}_t$ that runs through the sequence with only linear interactions, allowing gradients to flow over long distances. Information is added to or removed from the cell state through three learned gates.

The forget gate $\mathbf{f}_t$ decides what information from the previous cell state to discard. The input gate $\mathbf{i}_t$ decides which new information to write. The output gate $\mathbf{o}_t$ decides what to expose as the hidden state. Each gate is a sigmoid-activated layer producing values in $[0, 1]$. The cell state update is $\mathbf{c}_t = \mathbf{f}t \odot \mathbf{c}{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$, where $\tilde{\mathbf{c}}_t$ is the candidate update. The element-wise multiplication and addition preserve gradient magnitudes, creating a "gradient highway" along which error signals can travel hundreds of steps.

LSTMs dominated sequence modelling from roughly 2014 to 2017, powering machine translation, speech recognition, language modelling, sentiment analysis, and many other tasks. They have since been largely supplanted by transformers, whose parallel processing and attention-based long-range modelling typically outperform recurrent approaches. Nonetheless, LSTMs remain useful in streaming applications and on small datasets, and their core insights about gating have influenced many subsequent architectures.

Related terms: Recurrent Neural Network, GRU, Vanishing Gradient

Discussed in:

Also defined in: Textbook of AI