LSTM, Glossary, Textbook of AI

Long Short-Term Memory (LSTM), introduced by Hochreiter and Schmidhuber in 1997, is a recurrent neural network architecture designed to address the vanishing gradient problem of simple RNNs. The central innovation is a dedicated cell state $\mathbf{c}_t$ that runs through the sequence with only linear interactions, allowing gradients to flow over long distances. Information is added to or removed from the cell state through three learned gates.

The forget gate $\mathbf{f}_t$ decides what information from the previous cell state to discard. The input gate $\mathbf{i}_t$ decides which new information to write. The output gate $\mathbf{o}_t$ decides what to expose as the hidden state. Each gate is a sigmoid-activated layer producing values in $[0, 1]$. The cell state update is $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$, where $\tilde{\mathbf{c}}_t$ is the candidate update. The element-wise multiplication and addition preserve gradient magnitudes, creating a "gradient highway" along which error signals can travel hundreds of steps.

LSTMs dominated sequence modelling from roughly 2014 to 2017, powering machine translation, speech recognition, language modelling, sentiment analysis, and many other tasks. They have since been largely supplanted by transformers, whose parallel processing and attention-based long-range modelling typically outperform recurrent approaches. Nonetheless, LSTMs remain useful in streaming applications and on small datasets, and their core insights about gating have influenced many subsequent architectures.

Mathematics

An LSTM cell at time $t$ takes input $x_t \in \mathbb{R}^{d_x}$ and previous hidden and cell states $h_{t-1}, c_{t-1} \in \mathbb{R}^{d_h}$. Three gates modulate the flow of information through the cell state:

$$f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f) \quad \text{(forget gate)}$$

$$i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i) \quad \text{(input gate)}$$

$$o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o) \quad \text{(output gate)}$$

A candidate cell state is computed:

$$\tilde c_t = \tanh(W_c x_t + U_c h_{t-1} + b_c)$$

The new cell state combines the previous state (gated by forget) with the candidate (gated by input):

$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t$$

The hidden state is the cell state passed through $\tanh$ and the output gate:

$$h_t = o_t \odot \tanh(c_t)$$

The constant error carousel is the additive cell-state update: gradients $\partial c_t / \partial c_{t-1} = f_t$ flow through the additive path without the multiplicative crushing that destroys gradients in vanilla RNNs. With $f_t \approx 1$ (forget gate near open), gradients propagate over many timesteps; this is what makes LSTM trainable on long sequences.

Total parameters per LSTM layer: $4(d_h^2 + d_x d_h + d_h)$, four gates each with $d_h^2$ recurrent weights, $d_x d_h$ input weights and $d_h$ biases.

The GRU (Gated Recurrent Unit, Cho et al. 2014) merges the input and forget gates into a single update gate, reducing parameter count and sometimes matching LSTM performance.

Interactive

LSTM cell and the constant-error carousel. The forget gate keeps gradients alive over long sequences.

Video

Related terms: Recurrent Neural Network, GRU, Vanishing Gradient

Discussed in:

Chapter 12: Sequence Models, LSTMs & GRUs

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.