Sequence Models: 12.1   Why sequences are different

Dr Chris Paton

12.1 Why sequences are different

Many of the problems we most want to solve with machine learning are not about isolated points but about ordered streams. The words in a sentence arrive one after another and only make sense in the order they were spoken or written. The frames in a film, the samples in an audio waveform, the daily prices of a share, the keystrokes in a typed paragraph, the heart-rate readings in an ICU monitor, the mutations along a strand of DNA, all of these are sequences, and the order is part of the data, not a nuisance to be discarded. A model that ignores order will at best be reading the world through a bag-of-tokens haze; at worst it will confuse "the surgeon woke the patient" with "the patient woke the surgeon" and act on the wrong one.

This chapter is about the recurrent family of architectures that carried sequence modelling from the late 1980s up to the transformer revolution of 2017. We will work through the vanilla recurrent neural network (RNN), the long short-term memory (LSTM) and gated recurrent unit (GRU) cells that solved its long-range memory problem, the encoder-decoder seq2seq pattern that powered the first generation of neural machine translation, and the attention mechanism that bolted onto seq2seq and ultimately rendered the recurrence itself optional. Chapter 13 will then show how attention, freed from the recurrent backbone, becomes the transformer.

Before any architecture, though, it is worth being precise about what makes sequence data structurally different from the fixed-length feature vectors of Chapter 5 and the regular grids of Chapter 11.

Bridge from §11

Chapter 11 dealt with convolutional networks: an architecture whose inductive bias, local receptive fields, weight sharing across spatial position, hierarchical pooling, matched the structure of images. Sequences need analogous bias, but along the time axis rather than the spatial axes, and with two extra wrinkles: the axis can be arbitrarily long, and unlike a photograph the data arrive one element at a time during inference. A CNN sees the whole image at once; a sequence model often does not have that luxury. The recurrent network is the architecture you get if you ask "what does a CNN look like if I am only allowed to see one column of pixels per step?". The transformer of Chapter 13 is what you get if you ask the opposite question: "what if I am allowed to see all columns at once but without the convolutional locality constraint?". Both descend from the same need: model the order, share the weights, scale to any length.

Symbols Used Here

$\mathbf{x}_t$input at time $t$ \

$\mathbf{h}_t$hidden state at time $t$ \

$T$sequence length \

$\mathbf{W}$recurrent weight matrix

Why sequences are different

Three properties set sequence data apart from the tabular and image inputs we have met so far, and each forces a corresponding architectural choice.

Order matters. "Dog bites man" and "man bites dog" are composed of the same three tokens but report opposite events. A model that pools over position, averaging word embeddings, say, cannot tell them apart. The same is true in audio: the syllables /pa/ and /ap/ are time-reverses of one another and a position-blind model would identify them. The architecture must therefore encode the index $t$ as well as the value $x_t$. A recurrent network does this implicitly: each token enters a hidden state that has already been shaped by everything before it, so the same word read at position three lands in a different hidden state from the same word read at position seventeen.

Variable length. Tweets have eight tokens; clinical discharge summaries have eight thousand; a Shakespeare play has tens of thousands. We cannot fix a maximum length without either truncating real inputs or padding short ones with millions of useless zeros. A respectable sequence model must have a parameter count that does not grow with $T$. Recurrence delivers this for free: one set of weights, applied $T$ times. (Transformers manage the same trick by sharing one set of attention weights across all positions.)

Long-range dependencies. Linguistic structure routinely puts the items that need to be linked far apart. In the centre-embedded sentence "The cat that the dog from down the road chased yesterday is sleeping", the verb "is" agrees in number with "cat", twelve tokens earlier. In a film, the identity of a person revealed in the closing scene depends on a glance from the opening reel. In a heart trace, an arrhythmia might recur every few hundred beats. A model whose effective memory is only a handful of steps long will fail on all of these. This is the property RNNs find hardest to satisfy in practice, and the entire LSTM/GRU/attention story is a response to it.

It helps to keep four canonical sequence problems in mind throughout, since each places different demands on the architecture. Sequence classification (many-to-one) consumes a variable-length input and emits a single label (sentiment of a review, ICD-10 code for a clinical note, speaker identity from an audio clip). Sequence labelling (many-to-many, aligned) emits one label per input element (part-of-speech tags, named-entity tags, phoneme-per-frame in speech). Generation (one-to-many) reads a fixed prompt and rolls out a variable-length output autoregressively, each token conditioned on its predecessors (image captioning, story continuation, music generation). Sequence transduction (many-to-many, unaligned) maps one variable-length sequence to another of unrelated length (machine translation, speech-to-text, code synthesis). The same primitives (recurrence, gating, attention) handle all four; the loss and decoding strategy are what change.

These three demands (order, variability, long range) are sufficient to rule out every architecture from the previous chapters. An MLP needs a fixed input size. A CNN can be made translation-equivariant along time, but its receptive field grows only linearly with depth, so capturing twelve-token dependencies needs at least twelve layers and capturing five-hundred-token dependencies is hopeless. Bag-of-words ignores order by construction. Sequence-specific machinery is therefore not a luxury but a necessity.

A naive attempt to bridge the gap is to flatten a fixed window of $k$ previous elements and treat the result as a single $k \cdot d$-dimensional vector. This is exactly what an n-gram language model (§12.3) and a Bengio-style neural language model (§12.4) do. It works, after a fashion, but it cannot generalise across position, the weight tying that recurrence buys you for free has to be hand-coded, and the window size $k$ becomes a hard ceiling on dependency length.

The recurrent neural network, developed in §12.6, dissolves both problems in one stroke: process the sequence one element at a time and feed the hidden state back as an input at the next step. The same weights apply at every step, the architecture handles any length, and information from arbitrarily far back can in principle persist in the hidden state.

The RNN

A vanilla recurrent network is described by a single update rule applied repeatedly:

$$\mathbf{h}_t = \tanh(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}),$$

with an output $\mathbf{y}_t = \mathbf{W}_y \mathbf{h}_t + \mathbf{c}$ at each step (or only at the last step, depending on the task). The state $\mathbf{h}_t$ summarises everything seen so far in a fixed-dimensional vector, and the same matrices $\mathbf{W}_h$, $\mathbf{W}_x$ act at every step. Reading proceeds left-to-right: $\mathbf{h}_0$ is initialised to zero, $\mathbf{x}_1$ is consumed, $\mathbf{h}_1$ is computed, then $\mathbf{x}_2$ enters and so on until $\mathbf{x}_T$.

The hidden state plays the role of memory. Anything from earlier in the sequence can in principle still influence the prediction at step $t$, provided it survived the chain of $\tanh$ contractions in between. The output head is identical to the head of an MLP, a linear projection plus a softmax for classification, or a linear projection alone for regression, and the same loss functions and training tricks of Chapter 9 apply.

Why vanilla RNNs struggle

Training a recurrent network means backpropagating through time (§12.7). The gradient of the loss at step $T$ with respect to the hidden state at some earlier step $t$ takes the form of a product of Jacobians, one per time step in between, each of which is governed by $\mathbf{W}_h$ and the derivative of the $\tanh$. If the dominant singular value of those Jacobians is consistently less than one, their product shrinks geometrically and the gradient vanishes before it reaches step $t$; the network simply cannot learn that the loss at $T$ depends on $\mathbf{x}_t$. If the dominant singular value is consistently greater than one, the product explodes and the gradient becomes a wall of NaNs.

This is the same vanishing/exploding gradient problem we met in deep MLPs (§9.11), but in time rather than depth, and it bites much harder because the same matrix $\mathbf{W}_h$ is multiplied with itself at every step. There is no analogue of skip connections one can drop in. Empirically, vanilla RNNs trained with naive SGD struggle to learn dependencies more than ten or twenty steps long, which rules them out for almost everything interesting in natural language. We cover the analysis in §12.7 and the practical mitigations, gradient clipping, careful initialisation, orthogonal recurrent matrices, in §12.7.

LSTM and GRU

The long short-term memory cell (Hochreiter and Schmidhuber, 1997) and the gated recurrent unit (Cho et al., 2014) are the engineering response to the vanishing-gradient problem. Both wrap the recurrent step in a set of gates, small sigmoidal sub-networks, that decide what to keep, what to overwrite and what to expose, giving gradients a protected path along which to flow.

The LSTM augments the hidden state $\mathbf{h}_t$ with a separate cell state $\mathbf{c}_t$ and three gates:

the forget gate $\mathbf{f}_t$ chooses which components of $\mathbf{c}_{t-1}$ to erase;
the input gate $\mathbf{i}_t$ chooses which components of a candidate update $\tilde{\mathbf{c}}_t$ to write into the cell;
the output gate $\mathbf{o}_t$ chooses which components of the new cell state to expose as $\mathbf{h}_t$.

The cell update $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$ is purely additive, with no $\tanh$ in the recurrence path. This is the constant-error carousel: when the forget gate sits near one and the input gate near zero, information rides along the cell state untouched, and gradients flow back through it without geometric decay. Long-range learning becomes possible.

The GRU (§12.9) compresses the same idea into two gates: a reset gate that decides how much of the previous state to carry into the candidate update, and an update gate that interpolates between the old state and the new candidate. With one cell state instead of two and two gates instead of three it has roughly two-thirds of the parameters of an LSTM at matched hidden width and often matches its performance, particularly on moderate-length problems. It is a natural default when memory or compute is tight.

In practice, between roughly 2014 and 2018, LSTM and GRU were the recurrent architectures of choice for almost everything: language models, machine translation, speech recognition, music generation, time-series forecasting. Most of the headline results that preceded the transformer, Google Translate's 2016 production system, DeepSpeech 2, Show-and-Tell captioning, Char-RNN, were stacks of one or the other.

Encoder-decoder seq2seq

Sutskever, Vinyals and Le (2014) and Cho et al. (2014) proposed a clean recipe for transduction, mapping one variable-length sequence to a different variable-length sequence, using two RNNs in series. An encoder RNN reads the source sequence (a French sentence, say) and compresses it into its final hidden state, often called the thought vector or context vector. A decoder RNN is then initialised from that vector and unrolls autoregressively to produce the target sequence (the English translation), with each emitted token fed back as the input at the next step.

This was a notable architectural idea, a single fixed-length vector as the bottleneck between two languages, and worked surprisingly well at short lengths. But the bottleneck is also the weakness: forcing a fifty-word sentence through a single 1024-dimensional vector loses information, and translation quality fell off sharply as source length grew. The fix arrived almost simultaneously: Bahdanau, Cho and Bengio (2014) introduced attention. Rather than collapse the encoder's history into one vector, the decoder is allowed to attend, at each step, to all of the encoder's hidden states, weighted by a learned alignment score. The decoder no longer carries the burden of remembering everything; it can look back at the source whenever it needs to. Attention is covered in detail in §12.12 and is the conceptual seed of the entire transformer architecture of Chapter 13.

What this chapter covers

The remainder of the chapter develops these ideas in order. §12.2 sets up the language-modelling problem that has been the workhorse benchmark for sequence models since the 1980s. §12.3 covers classical n-gram models, both as baselines and to motivate distributed representations. §12.4 and §12.5 introduce word embeddings and tokenisation. §12.6 derives the vanilla RNN. §12.7 analyses why it struggles to learn long-range dependencies. §12.8 and §12.9 develop the LSTM and GRU. §12.10 covers bidirectional and stacked variants. §12.11 builds the encoder-decoder seq2seq, and §12.12 adds attention. §12.13 introduces CTC for unaligned labelling, §12.14 covers beam search and other decoding strategies, §12.15 walks through a working character-level LSTM in PyTorch, and §12.16 closes by explaining why transformers ultimately displaced recurrence for almost all sequence tasks.

What you should take away

Sequence data have three properties (order matters, length varies, dependencies can be long-range) that no fixed-input architecture handles cleanly.
The recurrent neural network's single trick is weight sharing across time: one matrix applied at every step, hidden state carrying memory forward, parameter count independent of $T$.
Vanilla RNNs train poorly because the gradient through time is a product of Jacobians and either vanishes or explodes; long-range learning needs a structural fix, not a hyperparameter tweak.
LSTM and GRU cells supply that fix by gating an additive cell-state update, the constant-error carousel, so gradients flow back through time without geometric decay.
Encoder-decoder seq2seq with attention was the last great recurrent architecture and the conceptual ancestor of the transformer; the attention mechanism it introduced is what made recurrence itself dispensable.