Sequence Models: 12.8   Long short-term memory

Dr Chris Paton

12.8 Long short-term memory

LSTM cell and the constant-error carousel · 1:15. The forget gate keeps gradients alive over long sequences. Open transcript and references →

The Long Short-Term Memory network, usually shortened to LSTM, was the dominant sequence model from its publication by Sepp Hochreiter and Jürgen Schmidhuber in 1997 until the rise of the Transformer twenty years later. For roughly a decade, from about 2014, when stacked LSTMs began to win speech-recognition benchmarks, until 2017, when "Attention Is All You Need" appeared, almost every state-of-the-art system that read a sentence, transcribed an utterance, or modelled a time series had an LSTM somewhere inside it. Google Translate's first neural production system was an LSTM. Amazon's Alexa wake-word detector was an LSTM. The first end-to-end speech recognisers that beat hand-engineered Gaussian-mixture models were LSTMs. The architecture has since been largely retired from frontier research, but it has not vanished: it remains a sensible default for embedded audio models, low-latency time-series forecasters, and any setting where parameter count and inference cost matter more than the last percentage point of accuracy.

The LSTM exists because the vanilla recurrent neural network, as introduced in §12.6, has a fatal weakness: the gradient that links a loss at time $t$ to a parameter that was used at time $t - k$ shrinks (or, more rarely, explodes) like the $k$-th power of the largest eigenvalue of the recurrent weight matrix. §12.7 made this problem precise. In practice it means a vanilla RNN cannot reliably learn dependencies more than ten or twenty steps long, which is hopeless for sentences, let alone paragraphs. The LSTM solves the problem with a structural change rather than a numerical trick: it adds a second internal state, the cell state, whose update is additive rather than multiplicative, and whose flow is controlled by learnt gates. Information, and the gradients that train the network to use it, can travel along the cell state for hundreds of time steps without compounding decay. §12.9 will introduce the GRU, a simpler relative of the LSTM with two gates instead of three; the comparison is easier to follow once the LSTM is fully understood.

Symbols Used Here

$\mathbf{x}_t$input at time $t$

$\mathbf{h}_t$hidden state (short-term, exposed to downstream layers)

$\mathbf{c}_t$cell state (long-term memory, internal)

$\mathbf{f}_t, \mathbf{i}_t, \mathbf{o}_t$forget, input, output gates

$\mathbf{g}_t$candidate cell update (sometimes written $\tilde{\mathbf{c}}_t$)

$\sigma$logistic sigmoid, squashing inputs to $(0, 1)$

$\odot$element-wise (Hadamard) product

The LSTM cell

An LSTM cell has two pieces of state that travel from one time step to the next: the hidden state $\mathbf{h}_t$, which is what the rest of the network sees, and the cell state $\mathbf{c}_t$, which is private to the cell and acts as long-term memory. Both are vectors in $\mathbb{R}^H$, where $H$ is the hidden size. At each step the cell receives the current input $\mathbf{x}_t$ together with the previous hidden and cell states, and produces new ones.

Three sigmoid-activated gates do the bookkeeping. Each gate is a vector in $(0, 1)^H$ produced by a small linear layer applied to the concatenation $[\mathbf{h}_{t-1}, \mathbf{x}_t]$. Because the sigmoid saturates at the ends, a trained gate behaves almost like a soft on/off switch for each dimension of the cell state.

Forget gate $\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$. For each dimension of the cell state, decides how much of the previous value to keep. A value near $1$ means "remember unchanged"; a value near $0$ means "erase".
Input gate $\mathbf{i}_t = \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)$. Decides how much of the candidate update to write. A high value lets new information in; a low value protects the cell from being overwritten.
Output gate $\mathbf{o}_t = \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)$. Decides how much of the (squashed) cell state to expose as the hidden state. The cell can hold information without revealing it.

Alongside the gates, the cell computes a candidate update, a tanh-activated proposal for what to add to the cell state, using the same concatenated input.

$$\mathbf{g}_t = \tanh(\mathbf{W}_g [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_g).$$

The tanh keeps the candidate bounded in $(-1, 1)$, so each addition to the cell state is on the same order of magnitude as the existing contents. The cell state is then updated by combining "what to keep" and "what to add":

$$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \mathbf{g}_t.$$

This is the heart of the LSTM. Note that the only operations along the cell-state path are element-wise multiplications by gate values (bounded between $0$ and $1$) and element-wise additions of bounded candidates. There is no matrix multiplication on the cell state itself, which is why the cell is sometimes described as a "constant-error carousel": error signals can ride this conveyor belt backwards without being repeatedly multiplied by an unconstrained recurrent weight matrix.

Finally the hidden state is read off the cell state through the output gate:

$$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t).$$

The tanh on $\mathbf{c}_t$ keeps the hidden state in $(-1, 1)^H$ regardless of how large the cell state has grown internally; the output gate then chooses which dimensions to expose. Counting parameters, each of the four linear maps ($\mathbf{W}_f, \mathbf{W}_i, \mathbf{W}_o, \mathbf{W}_g$) has shape $H \times (H + d)$ with a bias vector of length $H$, so the cell uses $4 \cdot H(H + d) + 4H$ trainable parameters, four times a vanilla RNN of the same width, hence the rule of thumb that an LSTM costs four times the compute and memory per step.

Why this fixes vanishing gradients

The vanilla RNN's failure mode is structural. Backpropagating through time, $\partial \mathbf{h}_t / \partial \mathbf{h}_{t-1} = \mathrm{diag}(\phi') \mathbf{W}_{hh}$, and the gradient over $k$ steps becomes a product of $k$ such matrices. Unless the spectral radius of $\mathbf{W}_{hh}$ sits delicately near $1$, the product either decays geometrically to zero or blows up geometrically. There is no pressure during training to keep that spectral radius in the safe band; it floats wherever the loss surface pushes it.

The LSTM dissolves the problem by changing the path along which gradients travel. Take the derivative of the cell-state update with respect to the previous cell state, treating the gates and the candidate as functions of $\mathbf{h}_{t-1}$ and $\mathbf{x}_t$ that do not depend on $\mathbf{c}_{t-1}$ directly:

$$\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathrm{diag}(\mathbf{f}_t).$$

This Jacobian is purely diagonal, with each entry bounded between $0$ and $1$. When the network learns $\mathbf{f}_{t,k} \approx 1$ for some dimension $k$, that dimension's cell-state contents pass through unchanged on the forward pass, and gradients flowing backwards through that dimension are also passed through unchanged. The cell state becomes a gradient highway: long-range error signals travel along it without compounding decay, even over hundreds of steps.

Equivalently, the LSTM has shifted a delicate optimisation problem into a routine learning problem. In a vanilla RNN, preserving long-range information requires the network to balance the eigenvalues of $\mathbf{W}_{hh}$ on a knife edge between vanishing and exploding. In an LSTM, the network simply learns to set the relevant forget-gate dimensions close to $1$, which any gradient-based optimiser can do. The hidden state $\mathbf{h}_t$ may still suffer from short-range vanishing of its own (its update path goes through tanh and the output gate) but the cell state carries the long-range information, and the hidden state only needs to carry whatever the cell decides to expose.

A small but well-known engineering detail makes this work in practice. If the forget-gate bias $\mathbf{b}_f$ is initialised to zero, then at the start of training $\mathbf{f}_t \approx 0.5$, so the cell state is halved at every step and information from the distant past is aggressively destroyed before the network has had a chance to learn what to keep. Setting $\mathbf{b}_f$ to a small positive constant, $+1$ or $+2$ is typical, pushes initial forget-gate values to $0.73$ or $0.88$ and lets the cell state persist long enough for gradients to reach back and teach the network what is worth remembering. Jozefowicz, Zaremba and Sutskever (2015) showed this single hyperparameter substantially accelerates LSTM training on long-dependency tasks; PyTorch's nn.LSTM ships with this initialisation by default. The original 1997 paper did not include the forget gate at all; it was added by Gers, Schmidhuber and Cummins in 2000 and is now considered part of the canonical form.

Worked example

It pays to crank through one step by hand. Take a one-dimensional LSTM ($H = 1$) with a single scalar input. Suppose at the current step $\mathbf{x}_t = 1$, the previous hidden state $\mathbf{h}_{t-1} = 0$, and the previous cell state $\mathbf{c}_{t-1} = 0.5$. Imagine that, after applying their respective linear layers and sigmoids, the gates evaluate to

$$\mathbf{f}_t = 0.9, \qquad \mathbf{i}_t = 0.7, \qquad \mathbf{o}_t = 0.5,$$

and the candidate evaluates to $\mathbf{g}_t = 0.4$ (a tanh output, bounded in $(-1, 1)$).

The cell-state update gives

$$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \mathbf{g}_t = 0.9 \cdot 0.5 + 0.7 \cdot 0.4 = 0.45 + 0.28 = 0.73.$$

Roughly two thirds of the new cell value comes from preserving the previous one and one third from writing in the candidate. The cell state has grown moderately: the previous contents are not erased, but they are augmented.

The hidden-state update gives

$$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t) = 0.5 \cdot \tanh(0.73) = 0.5 \cdot 0.6231 = 0.3116.$$

Two things are worth noticing. First, the cell state is $0.73$ but the hidden state, what the next layer sees, is only $0.3116$. The output gate has held some of the information back. If the cell is storing, say, the parity of how many opening brackets have been seen so far, the network can choose to keep that state private until a closing bracket is encountered, at which point a different output-gate dimension can swing open and reveal it.

Second, consider what a vanilla RNN would have done in the same situation. With a recurrent weight near $1$, a single-dimensional vanilla RNN computes $\mathbf{h}_t = \tanh(w_{hh} \mathbf{h}_{t-1} + w_{xh} \mathbf{x}_t)$, in which the previous hidden state and the new input are mixed at every step with no way to selectively preserve one without the other. The LSTM separates the two roles: the cell state holds whatever the network has chosen to remember, while the gates and candidate compute the response to the current input. This separation is the architectural payoff. Run the cell forwards for, say, fifty steps with $\mathbf{f}_t \approx 1$ and $\mathbf{i}_t \approx 0$ on dimensions you want to preserve, and the cell state on those dimensions barely moves; gradients backpropagated through that interval are similarly preserved.

Variants

Three variants of the LSTM are worth knowing because they appear in published papers and production code. The Gated Recurrent Unit (GRU), introduced by Cho and colleagues in 2014, merges the forget and input gates into a single update gate and folds the cell state into the hidden state, leaving just two gates and a candidate. It has roughly three quarters of an LSTM's parameters at the same width and trains slightly faster. Empirical comparisons (Greff et al. 2017; Jozefowicz et al. 2015) found GRU and LSTM essentially indistinguishable on most tasks, with the LSTM winning narrowly on the very longest dependencies and the GRU winning narrowly on small datasets. §12.9 covers the GRU equations in full.

The peephole LSTM (Gers and Schmidhuber 2000) adds connections from the cell state to each of the gates, so that, for example, the forget gate is computed as $\sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t, \mathbf{c}_{t-1}] + \mathbf{b}_f)$. The motivation is that the gates ought to be allowed to look at what is already in the cell when deciding what to do with it. In careful comparisons the peephole variant gives a small but consistent improvement on tasks with precise timing, such as music generation and counting tasks, and a negligible improvement on language modelling. Most production LSTM code does not use peepholes because the extra parameters and code complexity are not worth the marginal gain.

The bidirectional LSTM (BiLSTM) runs two independent LSTMs over the input sequence, one forward, one backward, and concatenates their hidden states at each step. The result is a representation of each token that is informed by the entire sequence on both sides. BiLSTMs were the workhorse encoders of pre-Transformer NLP: ELMo (Peters et al. 2018), the immediate predecessor of BERT, used a stacked BiLSTM, and almost every speech-recognition system between 2013 and 2018 used BiLSTM acoustic models. The cost is that BiLSTMs cannot be used for online, autoregressive generation; you need the whole input before you can compute the backward pass, so they are confined to encoder roles. Coupled-gate variants, in which $\mathbf{i}_t = 1 - \mathbf{f}_t$ is enforced (you can either keep or write but not both), are sometimes seen in compressed embedded models.

Where LSTMs are used in 2026

Although Transformers dominate frontier research, LSTMs remain in active use in several niches. The first is time-series forecasting in industry: utilities, finance, retail demand, predictive maintenance. For univariate or low-dimensional multivariate signals with hundreds rather than millions of observations, an LSTM regressor often beats a Transformer simply because it has fewer parameters to overfit. M4 and M5 forecasting competition winners through to the early 2020s were LSTM-based or LSTM-hybrid.

The second niche is embedded and on-device speech. Apple's pre-2023 Siri wake-word detector, Google's "Hey Google" detector, and most low-power keyword-spotting models in earbuds and hearing aids are small LSTMs (often quantised to 8-bit integers) running on dedicated DSPs. They draw a fraction of a milliwatt and respond in tens of milliseconds. A Transformer of equivalent accuracy would need roughly four times the energy budget, which a coin-cell-powered earbud cannot afford. Whisper and other modern open speech models use Transformer encoders, but legacy production systems based on LSTM encoder + connectionist temporal classification (CTC) decoder are still widely deployed and have not been replaced because the replacement cost outweighs the modest accuracy gain.

The third niche is legacy systems. Many 2018–2021 production deep-learning systems were LSTMs, and a working system that meets its service-level objective is rarely rewritten. Translation engines, recommender ranking models, fraud-detection sequence classifiers and autocomplete systems often retain LSTM cores even when their owners have a Transformer alternative on the shelf. The economic argument is straightforward: a Transformer migration costs engineering hours, model-quality risk, and potentially additional inference compute, while the accuracy gain over a well-tuned LSTM is often a single percentage point or less.

A fourth, more recent niche is the state-space model revival. Architectures such as Mamba (Gu and Dao 2023) and the wider class of structured state-space models share the LSTM's recurrent, linear-in-time inference profile but use a different mathematical backbone: diagonal-plus-low-rank state transitions parameterised by continuous-time ODEs and discretised via the bilinear transform. Mamba-2 (Dao & Gu 2024), Jamba (AI21, March 2024, 52B MoE hybrid), Samba (Microsoft), and IBM Granite 4.0 (2025) are the production exemplars of the SSM and SSM-attention hybrid family. They are not LSTMs, but the underlying engineering goal (a linear-time recurrent model that does not pay the quadratic attention cost of a Transformer) is a direct descendant of the LSTM tradition. Reading the original LSTM paper alongside the Mamba paper makes the lineage explicit: both architectures isolate a long-running internal state from the noisy, non-linear computation that consumes inputs and produces outputs, and both rely on multiplicative gating to control what enters that state.

Finally, LSTMs remain a useful pedagogical baseline. Implementing one from scratch (concatenating $[\mathbf{h}_{t-1}, \mathbf{x}_t]$, computing four linear layers, applying three sigmoids and a tanh, taking the elementwise products and sums) is an excellent exercise that reinforces backpropagation through time, gating, and the difference between additive and multiplicative state updates. §12.15 builds a character-level LSTM in PyTorch for exactly this reason, and many introductory deep-learning courses still teach the LSTM before the Transformer because the gating intuitions transfer directly to modern attention-based architectures.

What you should take away

The LSTM solves the vanishing-gradient problem of vanilla RNNs by adding an additive cell-state path, $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \mathbf{g}_t$, along which information and gradients travel without compounding decay.
Three sigmoid-activated gates (forget, input, output) read, write, and gate the visibility of the cell state. The gates are themselves computed from the previous hidden state and the current input, so the cell learns when to remember and when to forget.
Initialising the forget-gate bias to a positive value (typically $+1$) is a small but important detail: it lets cell-state information persist long enough for the network to learn what is worth keeping.
The LSTM dominated sequence modelling from roughly 1997 to 2017. The GRU (§12.9) is a simpler two-gate relative with comparable performance and fewer parameters; bidirectional LSTMs were the standard NLP encoder before BERT.
In 2026 the LSTM is no longer the frontier choice for new research, but it remains a strong default for low-data time-series forecasting, on-device keyword spotting and other low-resource settings where parameter count and inference cost dominate accuracy.