Sequence Models: 12.12   Attention

Dr Chris Paton

12.12 Attention

Sequence-to-sequence models are the natural answer to a problem the recurrent networks of the previous sections cannot, on their own, address: how do we map an input sequence of one length to an output sequence of another length, where neither the source nor the target length is known in advance? Machine translation is the canonical example. The English sentence "the cat sat" has three tokens; its French translation "le chat s'assit" has, depending on tokenisation, three or four. Speech recognition is another: a one-second audio clip might contain forty acoustic frames but only three phonemes. Summarisation collapses a thousand words into fifty. In each case, we need an architecture in which the decoder can produce as many tokens as the target requires, without being chained to the input length.

The encoder-decoder pattern proposed by Sutskever, Vinyals and Le (2014) and by Cho and colleagues that same year answered the question with a clean separation of duties. An encoder RNN reads the source sequence end to end, and its final hidden state is taken as a fixed-length summary of the whole input. A decoder RNN is then initialised from that summary and generates the target sequence one token at a time, with each token fed back in as the input at the next step until a special end-of-sequence symbol is produced. The whole system is differentiable and trains end to end with cross-entropy loss against the gold target.

The Bahdanau, Cho and Bengio paper (2014, published 2015) Bahdanau, 2014 then made the single most consequential change of the era: rather than collapse the encoder into one vector, let the decoder attend back to all the encoder states, and let it choose, at each output step, which of them to listen to. That single move broke the bottleneck and is the conceptual ancestor of the Transformer covered in §13. This section motivates the encoder-decoder model, derives Bahdanau's additive attention from first principles, walks through a translation example, sketches why the idea generalised so cleanly to attention-only architectures, and finally contrasts it with the alignment-free CTC approach used widely in speech.

The chain through Chapter 12 is now visible: §12.5 to §12.10 gave us recurrence and the gradient flow that drives BPTT through it; §12.11 stacked two RNNs into an encoder-decoder; §12.12 (this section) frees the decoder from a single bottleneck vector; §12.13 (CTC) handles the special case where source and target are monotonically aligned but unequal in length. Chapter 13 then asks the obvious question: if attention does the heavy lifting, do we need the recurrence at all? It answers no.

Symbols Used Here

$\mathbf{x}_1, \ldots, \mathbf{x}_T$source sequence

$\mathbf{y}_1, \ldots, \mathbf{y}_{T'}$target sequence

$\mathbf{h}_t$encoder hidden state

$\mathbf{s}_j$decoder hidden state

$\alpha_{ji}$attention weight at decoder step $j$ for encoder step $i$

$\mathbf{c}_j$context vector for decoder step $j$

Encoder-decoder with RNNs

In the vanilla encoder-decoder, the encoder is an RNN, typically a GRU or LSTM, often bidirectional, that reads the source sequence one token at a time and accumulates a hidden state $\mathbf{h}_t \in \mathbb{R}^H$ at each position. After consuming the final token $\mathbf{x}_T$, the encoder's last hidden state $\mathbf{h}_T$ is treated as a thought vector or context summary $\mathbf{c}$ that, in principle, packages everything the model has decided is worth remembering about the source.

The decoder is a second RNN whose initial hidden state $\mathbf{s}_0$ is set to that summary. At each step $j$ the decoder consumes the previous output token $\mathbf{y}_{j-1}$ as input, updates its hidden state $\mathbf{s}_j$, and projects $\mathbf{s}_j$ through a linear-then-softmax head to a distribution over the target vocabulary, from which $\mathbf{y}_j$ is sampled or argmax-decoded. Generation stops when the special end-of-sequence token is produced, or when a maximum length is reached. Training uses teacher forcing: at training time the decoder receives the gold previous token rather than its own previous prediction, so the gradients propagate cleanly through cross-entropy at every output position.

The architecture is clean on paper, and Sutskever and colleagues showed it could match the best statistical machine translation systems of 2014 on certain WMT benchmarks. But a uncomfortable observation followed quickly: performance fell off a cliff on long sentences. A fifty-word German sentence translated through a vanilla encoder-decoder routinely lost names, mistranslated middle clauses, or omitted whole phrases. The reason is structural. Every fact about the source (sixty tokens, perhaps, of subjects, objects, modifiers, tenses, nested clauses) has to be packed through one fixed-size $H$-dimensional vector before any decoding begins. With $H = 1000$ and a sentence of length sixty, the encoder must compress roughly sixty token embeddings of similar dimension into a single vector that the decoder can later interrogate. There is no information-theoretic miracle here; the bottleneck simply throws away whatever does not fit.

A second symptom is that the decoder can only access the source through $\mathbf{s}_{j-1}$, which is itself a bottlenecked summary. To recover, say, the spelling of a proper noun introduced at position 3 of the source while emitting target token 47, the relevant bits of $\mathbf{h}_3$ have to survive forty-four further encoder steps and the projection into $\mathbf{c}$, and forty-seven decoder steps after that. The vanishing gradient problem returns through the back door: even if the LSTM cell handles within-encoder gradient flow, the decoder cannot obtain a clean gradient back to early encoder positions through a single fixed vector. Cho's group quantified the effect by plotting BLEU against sentence length and showing the curve sloping firmly downwards as sentences grew. The ceiling was the vector, not the model.

Bahdanau attention

The Bahdanau, Cho and Bengio fix is, in retrospect, almost obvious, but only in retrospect. Run a bidirectional encoder over the source sentence to obtain hidden states $\mathbf{h}_1, \ldots, \mathbf{h}_T$ (we drop the "enc" superscript for clarity). The decoder maintains its own hidden state $\mathbf{s}_t$. At each decoder step $t$, instead of conditioning on a single fixed context, condition on a dynamic context $\mathbf{c}_t$ that is recomputed for every target position from all of the encoder states.

For each encoder position $j = 1, \ldots, T$, compute an unnormalised alignment score $e_{tj}$ measuring how relevant encoder state $\mathbf{h}_j$ is to decoder step $t$. Bahdanau's "additive" form is

$$e_{tj} = v^\top \tanh(W_a \mathbf{s}_{t-1} + U_a \mathbf{h}_j),$$

with learnable parameters $v \in \mathbb{R}^H$, $W_a \in \mathbb{R}^{H \times H}$ and $U_a \in \mathbb{R}^{H \times 2H}$ (the encoder is bidirectional, so $\mathbf{h}_j \in \mathbb{R}^{2H}$). The form is a one-hidden-layer feedforward network, which is why it is called additive: it adds the projected decoder and encoder vectors before passing them through a non-linearity. Normalise the scores into attention weights with softmax,

$$\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^{T} \exp(e_{tk})}, \qquad \sum_j \alpha_{tj} = 1.$$

Form the context as a weighted sum of encoder states,

$$\mathbf{c}_t = \sum_{j=1}^{T} \alpha_{tj} \, \mathbf{h}_j.$$

Concatenate $\mathbf{c}_t$ with $\mathbf{s}_{t-1}$ (and possibly with $\mathbf{y}_{t-1}$) to drive the decoder forward and produce the next-token logits. Every step in this pipeline is differentiable in all parameters, including the alignment scores, since softmax is smooth, so the entire system trains end to end by ordinary backpropagation through time on the cross-entropy loss.

Two things are worth dwelling on. First, the attention weights $\alpha_{t,:}$ form a probability distribution over encoder positions; they are soft alignments between target step $t$ and source positions. They tell us, and the network, which source words the decoder is "looking at" when producing the $t$-th target word. For translation, the attention map typically tracks the diagonal, with reorderings that reflect word-order differences between source and target languages, and resembles the kinds of word alignments that earlier statistical machine translation systems learnt as a separate, hard step. Bahdanau's model learns these alignments jointly with the translation model, end to end, with no alignment supervision at all.

Second, the bottleneck of vanilla seq2seq is gone. Information from any encoder position can reach any decoder position in a single attention hop. The gradient of the loss at decoder step $t$ flows directly to encoder state $\mathbf{h}_j$ via $\alpha_{tj}$, regardless of how far apart $t$ and $j$ are. Translation quality on long sentences improved dramatically; the BLEU-versus-sentence-length curve, which used to slope downwards, flattened out. The fix was, mechanically, the addition of a tiny feedforward scoring network and a softmax, but conceptually it replaced a one-shot compression with a query-driven look-up.

Worked example

Take a tiny example with $T = 3$ encoder states for the source words "the", "cat", "sat", and a current decoder state $\mathbf{s}_{t-1}$ that is about to produce the second French word. Suppose the additive scoring network has produced the alignment scores $e_t = (0.5, 2.0, -1.0)$. The softmax gives

$$\alpha_t = \mathrm{softmax}(0.5, 2.0, -1.0) \approx \frac{1}{8.86}(1.65, 7.39, 0.37) \approx (0.186, 0.834, 0.041).$$

The decoder's context vector is $\mathbf{c}_t = 0.186 \, \mathbf{h}_1 + 0.834 \, \mathbf{h}_2 + 0.041 \, \mathbf{h}_3$, overwhelmingly dominated by the encoder state for "cat". If the decoder is currently producing "chat", this is exactly the alignment we hoped for, and it has been discovered without any explicit alignment supervision: only the cross-entropy loss against the gold French translation "le chat s'assit" was used.

Across the full sentence, the attention matrix $A \in \mathbb{R}^{T' \times T}$ for "the cat sat" $\to$ "le chat s'assit" looks roughly like this, with rows indexed by the French target and columns by the English source:

	the	cat	sat
le	0.92	0.06	0.02
chat	0.04	0.93	0.03
s'assit	0.02	0.05	0.93

The matrix is essentially diagonal because English and French share an SVO order with cognate vocabulary, so the alignment is monotonic and one-to-one. For non-cognate language pairs the picture is messier but still interpretable. English-to-Japanese attention maps typically show a strong anti-diagonal component because Japanese is verb-final: the verb at the end of the English source attends to the verb produced near the end of the Japanese target, while subject and object swap positions. Long-range syntactic dependencies, relative clauses, scrambling, object dropping, show up as off-diagonal smudges, and in the early NMT papers these maps were used as qualitative evidence that the network was not merely memorising surface bigrams.

Why this set up the transformer

The Transformer architecture of Vaswani and colleagues 2017 is, almost literally, attention without the recurrence. The decoder still queries a set of encoder states and forms a weighted sum. What changes is that (i) the recurrent encoder is replaced by stacked self-attention layers, in which every source position queries every other source position, so the encoder no longer has to read left-to-right at all; (ii) the decoder also uses self-attention over its own previous outputs, with a causal mask to prevent peeking; and (iii) the additive scoring network is replaced by scaled dot-product attention, $e_{tj} = (\mathbf{s}_t^\top \mathbf{h}_j) / \sqrt{d_k}$, which is cheaper to compute and parallelises trivially on GPUs. The scaling prevents the dot products from growing large for high-dimensional vectors and pushing the softmax into saturated regions.

The case for getting rid of the RNN was practical as much as theoretical. Recurrence imposes a serial dependency: encoder step $t$ cannot start until step $t - 1$ has finished. On a GPU this is wasted parallelism. Self-attention over a sequence of length $T$ can be computed as a single $T \times T$ matrix multiplication, fully parallel across positions. Training time per token dropped by roughly an order of magnitude on the same hardware, which in turn made it tractable to scale models to hundreds of millions and then billions of parameters. The attention idea remained; the recurrence did not. Chapter 13 develops the Transformer in detail, including multi-head attention, positional encodings, layer normalisation, and the residual stream.

CTC and other alignment-free approaches

Attention is the right tool when the alignment between source and target is genuinely many-to-many, with reorderings, translation being the prototype. Speech recognition is different: the alignment is monotonic (audio frames arrive in time order, transcript characters are written in time order) but the source is much longer than the target, because each phoneme spans tens of audio frames. Connectionist Temporal Classification (Graves et al. 2006) exploits exactly this structure. CTC augments the output alphabet with a special blank symbol $\varnothing$, lets the network emit a label at every audio frame, and then defines the probability of a target sequence as the sum over all frame-level label sequences that collapse to it after merging consecutive duplicates and removing blanks. The sum is computed efficiently with a forward-backward dynamic programme, and the resulting loss is differentiable.

CTC and attention coexist in modern speech systems. OpenAI's Whisper model uses an attention-based encoder-decoder. Many production speech recognisers use CTC, often jointly with attention in a hybrid loss, or use the closely related RNN-Transducer (RNN-T) which conditions emissions on previous outputs as well as audio. The moral is that attention is not always the right answer: when the alignment is constrained, alignment-aware losses like CTC can train faster, decode in a streaming fashion, and avoid a class of failure modes (over-translation, premature termination) that haunt vanilla attention.

What you should take away

Sequence-to-sequence models pair an encoder RNN that reads the input with a decoder RNN that produces a variable-length output, freeing the architecture from the input-equals-output-length assumption.
Vanilla seq2seq forces every fact about the source through a single fixed vector; this caused a measurable collapse in quality on long sentences and motivated the next idea.
Bahdanau additive attention computes alignment scores $e_{tj} = v^\top \tanh(W_a \mathbf{s}_{t-1} + U_a \mathbf{h}_j)$, normalises them with softmax to weights $\alpha_{tj}$, and forms a dynamic context $\mathbf{c}_t = \sum_j \alpha_{tj} \mathbf{h}_j$ at every decoder step, fully differentiable, trained end to end, with no alignment supervision.
The attention weights are interpretable as soft word alignments, and they retain a direct gradient path between any encoder and decoder position, which is why long-sentence performance recovered.
Removing the recurrence and keeping the attention gave the Transformer; replacing attention with a blank-symbol forward-backward marginal gave CTC for monotonic alignment problems like speech. Both descend from the same insight that the right level of abstraction for sequence modelling is alignment, not recurrence.