Sequence Models: 12.6   The recurrent neural network

Dr Chris Paton

12.6 The recurrent neural network

A recurrent neural network, or RNN, is the simplest neural architecture designed for data that arrive in a sequence. Where a feed-forward network sees a fixed-size input all at once, an RNN reads its input one element at a time and carries information forward in a small running summary called the hidden state. You can think of the hidden state as the network's working memory: at every step it combines what it remembers with what it has just seen, then passes the updated memory to the next step. The architecture is built around one central recurrence,

$$\mathbf{h}_t = \tanh(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}),$$

which says: take the previous memory $\mathbf{h}_{t-1}$, add the contribution of the current input $\mathbf{x}_t$, squash the result with $\tanh$, and call that the new memory. That is the entire model. Everything else is bookkeeping around this one update.

The compactness of the design (the same small set of weights reused at every step, a single non-linearity, a memory that grows in capacity rather than in parameters) is what made RNNs the dominant sequence model from the late 1990s until the rise of the Transformer in 2017. The ugliness, that the same recurrence which makes them compact also makes them difficult to train on long sequences, is what eventually unseated them. §12.7 unpacks backpropagation through time and the vanishing-gradient pathology; §12.8 develops the LSTM.

Symbols Used Here

$\mathbf{x}_t$input vector at time step $t$

$\mathbf{h}_t$hidden state at time step $t$

$\mathbf{y}_t$output (e.g. logits) at time step $t$

$\mathbf{W}_x$input-to-hidden weight matrix

$\mathbf{W}_h$hidden-to-hidden (recurrent) weight matrix

$\mathbf{W}_y$hidden-to-output weight matrix

$\mathbf{b}$hidden bias; $\mathbf{b}_y$, output bias

$T$sequence length; $H$, hidden-state dimension; $d$, input dimension

The recurrent equation

Let the input sequence be $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$ with each $\mathbf{x}_t \in \mathbb{R}^d$. Choose a hidden size $H$ and an initial state $\mathbf{h}_0 \in \mathbb{R}^H$, conventionally the zero vector or a learnt parameter. The forward pass of a vanilla (Elman) RNN is then nothing more than the recurrence

$$\mathbf{h}_t = \tanh(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}), \qquad t = 1, 2, \ldots, T,$$

with $\mathbf{W}_x \in \mathbb{R}^{H \times d}$, $\mathbf{W}_h \in \mathbb{R}^{H \times H}$ and $\mathbf{b} \in \mathbb{R}^H$. The non-linearity is canonically $\tanh$ because it is bounded in $(-1, 1)$ and so keeps activations from drifting; ReLU works but is more prone to instability because it is unbounded above. From the hidden state we read off an output whenever we need one,

$$\mathbf{y}_t = \mathbf{W}_y \mathbf{h}_t + \mathbf{b}_y,$$

and apply whatever final transformation the task requires: softmax over a vocabulary for language modelling, sigmoid for binary classification, or no transformation at all for regression.

The single most important property of this architecture is weight sharing across time. The matrices $\mathbf{W}_x$, $\mathbf{W}_h$ and $\mathbf{W}_y$ do not depend on $t$. The very same weights are applied at step 1, at step 17, and at step 17{,}000. This has three consequences. First, the parameter count is independent of sequence length, so an RNN can in principle be trained on sequences of one length and applied to sequences of another. Second, a pattern learnt at one position generalises automatically to other positions: an RNN that has learnt to recognise the word "however" near the start of a sentence will recognise it equally near the end. Third, computation is irreducibly sequential: $\mathbf{h}_{17}$ depends on $\mathbf{h}_{16}$, which depends on $\mathbf{h}_{15}$, and so on, so the forward pass cannot be parallelised across the time axis. This last property is exactly what the Transformer threw out, and explains why it can train so much faster on modern hardware (§12.16).

Worked numerical example

Numbers make the equations less abstract. Take a tiny RNN with input dimension $d = 4$ and hidden size $H = 2$, and feed it the three-token sentence "the cat sat". Suppose each word has been embedded as a 4-dimensional vector,

$$\mathbf{x}_1 = (1, 0, 0, 0)^\top, \quad \mathbf{x}_2 = (0, 1, 0, 0)^\top, \quad \mathbf{x}_3 = (0, 0, 1, 0)^\top,$$

so that the embeddings are simply one-hot for this illustration. Let

$$\mathbf{W}_x = \begin{pmatrix} 0.5 & -0.3 & 0.2 & 0.1 \\ 0.1 & 0.4 & -0.2 & 0.0 \end{pmatrix}, \quad \mathbf{W}_h = \begin{pmatrix} 0.2 & 0.1 \\ -0.1 & 0.3 \end{pmatrix}, \quad \mathbf{b} = (0, 0)^\top,$$

and initialise $\mathbf{h}_0 = (0, 0)^\top$.

Step 1. Because $\mathbf{h}_0 = \mathbf{0}$, the recurrent term vanishes, so $\mathbf{z}_1 = \mathbf{W}_x \mathbf{x}_1$. The first column of $\mathbf{W}_x$ is $(0.5, 0.1)^\top$, hence $\mathbf{z}_1 = (0.5, 0.1)^\top$. Applying $\tanh$ elementwise,

$$\mathbf{h}_1 = (\tanh 0.5, \tanh 0.1)^\top \approx (0.4621,\ 0.0997)^\top.$$

The network has now compressed everything it knows about the first word into two numbers.

Step 2. The second column of $\mathbf{W}_x$ is $(-0.3, 0.4)^\top$, so $\mathbf{W}_x \mathbf{x}_2 = (-0.3, 0.4)^\top$. The recurrent contribution is

$$\mathbf{W}_h \mathbf{h}_1 \approx \begin{pmatrix} 0.2 \cdot 0.4621 + 0.1 \cdot 0.0997 \\ -0.1 \cdot 0.4621 + 0.3 \cdot 0.0997 \end{pmatrix} \approx \begin{pmatrix} 0.1024 \\ -0.0163 \end{pmatrix}.$$

Adding gives $\mathbf{z}_2 \approx (-0.1976, 0.3837)^\top$ and so $\mathbf{h}_2 \approx (-0.1951, 0.3661)^\top$. Notice that $\mathbf{h}_2$ now reflects both words: the recurrent term carried information about "the" forwards, and the input term injected information about "cat".

Step 3. The third column of $\mathbf{W}_x$ is $(0.2, -0.2)^\top$. The recurrent term is

$$\mathbf{W}_h \mathbf{h}_2 \approx \begin{pmatrix} 0.2 \cdot (-0.1951) + 0.1 \cdot 0.3661 \\ -0.1 \cdot (-0.1951) + 0.3 \cdot 0.3661 \end{pmatrix} \approx \begin{pmatrix} -0.0024 \\ 0.1294 \end{pmatrix}.$$

So $\mathbf{z}_3 \approx (0.1976, -0.0706)^\top$ and $\mathbf{h}_3 \approx (0.1950, -0.0705)^\top$. This vector is the running summary of the entire three-token sentence; in a sentiment classifier we would feed $\mathbf{h}_3$ into a small head to predict positive or negative; in a language model we would have computed a softmax at every step instead, predicting the next word from each $\mathbf{h}_t$ in turn. Notice already that $\mathbf{h}_3$ is small in magnitude; the bounded $\tanh$ keeps the state from running away, but it also means the contribution of $\mathbf{h}_1$ has been shrunk twice. Repeat this 200 times and almost nothing of the original input survives.

Variants

The same recurrence supports several distinct task shapes, distinguished by where inputs and outputs are placed along the time axis.

Many-to-one problems read in a sequence and produce a single output at the end. Sentiment classification is the standard example: feed a film review token by token and read off a positive/negative label from $\mathbf{h}_T$. Document classification, sequence-level fraud detection and patient-trajectory risk prediction all fit this template.

One-to-many problems take a fixed input and unroll a sequence from it. Image captioning is the canonical case: a CNN encodes the image into a single vector, that vector becomes $\mathbf{h}_0$ of an RNN decoder, and the decoder generates a caption word by word. Music generation conditioned on a style tag has the same shape.

Many-to-many with aligned lengths produces one output per input. Part-of-speech tagging, named-entity recognition and frame-level acoustic modelling all fit here: the network emits $\mathbf{y}_t$ at every step, supervised against a per-token label. Bidirectional RNNs (§12.10), which run a second recurrence backwards in time and concatenate the two hidden states, are the standard upgrade for this shape, because tag $t$ often depends on words that come after it.

Many-to-many with unaligned lengths is the seq2seq problem, where input and output sequences have different lengths and word orders. Machine translation is the prototype. The standard solution is the encoder-decoder: an encoder RNN reads the source and compresses it into a context vector (often $\mathbf{h}_T$); a decoder RNN, initialised from that context, generates the target one token at a time, feeding each emitted token back as the next input. This basic encoder-decoder is the architecture in which attention (§12.12) was first introduced, precisely to relieve the bottleneck of squeezing a whole sentence through a single hidden state.

These four shapes are not exclusive to recurrent networks (Transformers handle all of them too) but they are easiest to introduce here because the RNN's per-step structure makes the input-output alignment explicit.

Limitations

The vanilla RNN looks tidy on paper but is notoriously hard to train in practice, and the cause is the very recurrence that makes it compact. Backpropagating the loss through $T$ time steps multiplies together $T$ Jacobians of the form $\mathrm{diag}(1 - \tanh^2 z_j)\, \mathbf{W}_h$. The $\tanh$ derivative is at most 1, the diagonal entries are typically well below 1 once the units saturate, and $\mathbf{W}_h$ has whatever spectral radius the optimiser has settled on. If the spectral radius of the product is below 1, the gradient shrinks geometrically with $T$ and the network cannot learn dependencies more than a handful of steps long. If it is above 1, the gradient explodes and the optimiser diverges. §9.11 gave the formal analysis; §12.7 specialises it to RNNs and shows why even a moderately deep unrolled network, say 50 steps, typically fails to propagate useful gradient information from the loss back to the early hidden states.

Empirically, vanilla RNNs in the 1990s could reliably learn dependencies of length 5 to 10 and not much more. Two architectural fixes changed this. The Long Short-Term Memory unit of Hochreiter and Schmidhuber (1997), introduced in §12.8, replaces the additive $\tanh$ recurrence with a gated additive carry path along which the gradient can flow without repeated multiplication. The Gated Recurrent Unit of Cho et al. (2014), §12.9, simplifies the LSTM to two gates while keeping most of its benefit. Both share the central idea that the gradient should travel along an unobstructed channel, with multiplicative gates deciding what enters and leaves it. Other tricks (gradient clipping, careful identity or orthogonal initialisation of $\mathbf{W}_h$, layer normalisation) help at the margins but do not change the fundamental geometry.

Where vanilla RNNs are used in 2026

Almost nowhere. The vanilla Elman cell as described above survives mostly as a teaching device and as a baseline in benchmark papers. LSTMs and GRUs persist in production niches where their sequential, low-latency, low-memory profile is genuinely useful: small on-device wake-word detectors, certain real-time audio enhancement and noise-suppression pipelines, and embedded controllers where a Transformer's quadratic attention is unaffordable. The state-space models (Mamba, S4) that appeared from 2022 onwards are recurrent in spirit and have begun to reclaim some of this territory with much better scaling. For everything large (language modelling, machine translation, speech recognition, vision-language modelling) the Transformer (§12.16, Chapter 13) has replaced the RNN entirely.

What you should take away

An RNN processes a sequence one element at a time, maintaining a hidden state $\mathbf{h}_t$ that summarises everything seen so far through the recurrence $\mathbf{h}_t = \tanh(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b})$.
The same weights are reused at every step. Parameter count is independent of sequence length, but the forward pass is irreducibly sequential.
The four task shapes (many-to-one, one-to-many, many-to-many aligned, many-to-many unaligned) cover most sequence problems and map naturally onto the same recurrent core.
Training is by backpropagation through time, and the resulting product of step-Jacobians is the source of the vanishing- and exploding-gradient pathologies that limit vanilla RNNs to short dependencies.
In modern practice, vanilla RNNs are a pedagogical stepping stone. LSTMs and GRUs (§12.8–12.9) fix the gradient geometry with gates; Transformers (§12.16) replace recurrence altogether for almost all large-scale work.