Attention & Transformers: 13.1   From RNN attention to self-attention

Dr Chris Paton

13.1 From RNN attention to self-attention

By the middle of 2017, the state of the art in machine translation looked settled. If you wanted a system that could read English and produce French, you reached for a sequence-to-sequence model: a recurrent neural network as the encoder, another recurrent neural network as the decoder, and an attention mechanism between them so the decoder could glance back at the source sentence as it generated each target word. This was the architecture that powered Google Translate's late-2016 upgrade. It was the architecture taught in every advanced NLP course. It was, by all appearances, the future of language understanding.

Then a small group at Google Brain and Google Research published a paper with a deliberately provocative title: Attention Is All You Need. The claim was that the recurrent backbone, the part that had felt structurally essential, the part that gave the network its sense of sequence, could be deleted. Not augmented, not improved upon, but removed entirely. What replaced it was a stack of attention layers in which every position in a sentence looked directly at every other position, in parallel, in a single step. The authors called the result a transformer. Within five years it had displaced recurrent networks not just in translation but in language modelling, vision, speech, protein folding, robotics, and the underlying machinery of every frontier AI system you have heard of.

This section retraces that move. We start with the seq2seq-with-attention world that the transformer was reacting against, follow Vaswani and colleagues to their decisive simplification, look at how self-attention differs from the cross-attention it grew out of, walk through a tiny worked example, and end with a frank account of why this particular architecture scaled when nothing before it had. The chapter that follows fills in the mathematics; this section is a map of the terrain.

A note on chapter geography. Section 12 covered recurrent networks and the seq2seq-with-attention pipeline that dominated translation up to 2017. The present chapter introduces the transformer as a clean break from that lineage. Section 13.2 takes the central operation, scaled dot-product attention, and works it out in detail. The rest of the chapter builds outward from there: multi-head attention in 13.3, the various flavours of attention in 13.4, positional encodings in 13.5, the full transformer block in 13.6, and so on through to the modern variants.

Symbols Used Here

$\mathbf{h}_t$RNN hidden state at step $t$

$\mathbf{c}$context vector

$\alpha_{ij}$attention weights

$T$sequence length

Recap: seq2seq with attention

Imagine you are building a translator in 2016. Your encoder is a long short-term memory network. You feed it the source sentence one token at a time. After reading the first word, the network produces a hidden state $\mathbf{h}_1 \in \mathbb{R}^{1024}$, a thousand-dimensional vector that encodes what the network has seen so far. After the second word, it produces $\mathbf{h}_2$, with $\mathbf{h}_1$ folded into the recurrence. By the time the encoder has read all $T$ tokens you have a sequence of hidden states $\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_T$, each summarising the sentence up to its position.

In Sutskever and colleagues' original 2014 design, only the final state $\mathbf{h}_T$ was passed to the decoder. That single vector had to carry the whole meaning of the sentence. It is an engineering feat that this worked at all on short sentences, and a clear failure on long ones. If your sentence has thirty tokens, asking a thousand-dimensional vector to remember every clause and modifier is asking too much. Anything that did not survive the compression was simply lost.

Bahdanau, Cho and Bengio fixed the bottleneck the obvious way: keep all the encoder states, and let the decoder choose which to look at. At every output step $t$, the decoder computes an alignment score between its current hidden state and each encoder state, normalises those scores into a probability distribution $\alpha_{t,1}, \ldots, \alpha_{t,T}$ using softmax, and forms a context vector

$$ \mathbf{c}_t = \sum_{i=1}^{T} \alpha_{t,i}\, \mathbf{h}_i, $$

a weighted average over the encoder states. The decoder uses $\mathbf{c}_t$ together with its own state to predict the next target word. Crucially, the weights $\alpha_{t,i}$ are computed afresh for every output step, so the decoder can attend to "Westphalia" while producing one French word and to "1648" while producing the next.

This was a clean idea and it worked. Translation quality jumped. Long-sentence performance recovered. By 2016, attention had become standard across machine translation, image captioning, speech recognition, and reading comprehension. But notice what was being added: attention sat on top of a recurrent backbone. The encoder still walked through the source sentence one token at a time. The decoder still walked through the target sentence one token at a time. Information about the first word still had to ride a chain of $T$ recurrent updates before it influenced the last hidden state.

This sequential processing is the hidden cost of recurrence. To compute $\mathbf{h}_t$ you need $\mathbf{h}_{t-1}$, which you need $\mathbf{h}_{t-2}$ for, and so on. There is no way to compute $\mathbf{h}_5$ in parallel with $\mathbf{h}_4$. On a graphics card, where everything fast comes from doing thousands of operations simultaneously, the recurrent dimension is a wall. You can throw more silicon at the problem and it will sit idle, waiting for the previous time step to finish. By 2017 this was the bottleneck, not modelling power, but training throughput. Every extra hour spent walking through sequences was an hour the GPUs were not learning.

Vaswani et al. (2017): "Attention Is All You Need"

The radical move in Vaswani and colleagues' 2017 paper was to ask whether the recurrent backbone was needed at all. Their answer was no. They proposed an architecture made of stacked attention layers, with no recurrence and no convolution anywhere. Every token in the input attends directly to every other token in the input, in parallel, in a single step. The encoder is six such layers; the decoder is six more. Between them is a feed-forward network applied identically to each position. That is the whole architecture.

Two consequences fall out of this design, and they are the reasons the transformer took over.

The first is parallelism. Because there is no recurrence, every position in the sequence can be processed simultaneously. The matrix multiplications that implement attention happen in one shot across the whole sentence. On a modern GPU this is a vastly better fit than the step-by-step march of an LSTM. A transformer training on the same hardware as an equivalent recurrent network can chew through several times more data in the same wall-clock hour. When the limit on neural network capability is how fast you can feed the model examples, and by 2017 it clearly was, this throughput advantage compounds into capability.

The second is that any two tokens are exactly one attention hop apart, regardless of how far apart they sit in the sequence. In an LSTM, information from token 1 reaches token 100 only by passing through 99 sequential updates, with each update introducing the possibility of forgetting or distortion. In a self-attention layer, token 100 simply queries token 1 directly: a single dot product, a single softmax-weighted sum. Long-range dependencies (agreement across clauses, anaphora, the relationship between a question and its answer ten paragraphs earlier) no longer have to survive a long chain of recurrences. They are first-class citizens of the architecture.

When the paper landed, the result was almost too clean. The transformer matched or beat the best recurrent translation systems on the standard benchmarks while training in a fraction of the wall-clock time. Within a year, BERT had used the same architecture to redefine the state of the art across most of natural-language understanding. The seq2seq-with-attention era was not slowly superseded; it was abruptly closed.

Self-attention vs other attention

It helps to keep the vocabulary clean. The attention introduced by Bahdanau is what we now call cross-attention: queries come from one sequence (the decoder), and keys and values come from another sequence (the encoder). The decoder is asking "what part of the source should I look at?" and the encoder states are providing the answers. Two separate sequences, two separate roles.

Self-attention is the move that defines the transformer. Queries, keys and values all come from the same sequence. Every position takes a turn as the asker (forming a query), as the offer (forming a key), and as the content (forming a value). When the query at position $i$ matches the key at position $j$, the value at position $j$ flows into the new representation at position $i$. Each position simultaneously gathers information from every other position, weighted by how relevant they are to it.

Why does this matter? Because most of what a sentence "means" is relational. The pronoun "it" only acquires a referent by attending to an earlier noun. The verb "agrees" only acquires its number by attending to its subject. The word "bank" only resolves to financial institution or river edge by attending to surrounding context. Self-attention is a mechanism designed to do exactly this kind of relational lookup, and to do it at every layer, so that representations get progressively more contextualised as you stack layers.

A useful mental picture is a soft Python dictionary. A normal dictionary takes a query, finds the unique matching key, and returns its value. A soft dictionary scores the query against every key, normalises the scores into a probability distribution, and returns the expected value under that distribution, a weighted blend. If one key matches strongly, you get something close to the original key's value. If many keys match weakly, you get a mixture. Self-attention is a soft dictionary lookup performed by every position against every other position in the same sequence.

A few subtler points are worth noting. Self-attention is permutation-equivariant: if you shuffle the input tokens, the outputs shuffle the same way, with no other change. The mechanism has no built-in notion of order. To recover word order you must add positional information separately, which is why §13.5 exists. Self-attention is also fully parallel across positions: the output at every position is computed in one matrix multiply against the same shared keys and values. And the receptive field is uniform: every output position sees every input position, no matter how far apart they are.

The transformer's encoder uses pure self-attention. Its decoder uses both self-attention (over the target side) and cross-attention (over the encoder output, in the seq2seq sense). The combination of these two attention types in a stacked architecture is what made the transformer expressive enough to replace the entire recurrent translation pipeline.

Worked example: a tiny self-attention

Take the sequence "the cat sat". Three tokens. Suppose each is embedded as a four-dimensional vector $\mathbf{x}_i \in \mathbb{R}^4$. Stack them into a matrix $\mathbf{X} \in \mathbb{R}^{3 \times 4}$, one row per token.

Self-attention introduces three learned projection matrices $\mathbf{W}^Q$, $\mathbf{W}^K$, $\mathbf{W}^V$, each shaped $4 \times 4$ in this toy example. We multiply the input by each projection to produce three new matrices:

$\mathbf{Q} = \mathbf{X} \mathbf{W}^Q$, the queries (one row per token, asking "what am I looking for?");
$\mathbf{K} = \mathbf{X} \mathbf{W}^K$, the keys (one row per token, advertising "here is what I have");
$\mathbf{V} = \mathbf{X} \mathbf{W}^V$, the values (one row per token, providing the actual content to be passed on).

Now compute the score matrix. For each pair of positions $i, j$ we form the dot product $\mathbf{q}_i \cdot \mathbf{k}_j$. Stacking these into a matrix is just $\mathbf{Q} \mathbf{K}^\top$, a $3 \times 3$ object. Row $i$ tells us how strongly position $i$ wants to listen to each of the three positions, including itself. Apply softmax row-wise, so each row sums to one. The result is the attention weight matrix.

Finally, multiply this $3 \times 3$ weight matrix by $\mathbf{V}$. The output is a new $3 \times 4$ matrix in which each row is a weighted average of all three value vectors, with the weights determined by how relevant each position was to the query. So "cat" might end up mostly itself plus a bit of "the" (resolving the determiner) and a bit of "sat" (gathering predicate context). "Sat" might end up mostly itself plus a strong contribution from "cat" (its subject).

That is a single self-attention layer. Section 13.2 develops this with explicit numbers, the scaling factor $\sqrt{d_k}$, and the masking step needed for autoregressive use; here the goal is just to see the shape of the operation: project, score, softmax, blend.

Why this scaled so well

Three forces, working together, account for the transformer's dominance.

The first is hardware fit. Modern accelerators reward dense matrix multiplication and punish sequential dependencies. A transformer training step is, almost end to end, a sequence of large matrix multiplies and a softmax. There is essentially nothing the GPU has to wait for. By contrast, an LSTM step has roughly the same arithmetic but cannot be parallelised across time, so it sees a fraction of peak hardware utilisation. At the same parameter count, transformers train roughly an order of magnitude faster in wall-clock terms. When data is the bottleneck, training faster is the same as training a stronger model.

The second is graceful scaling. Recurrent networks suffered badly from vanishing gradients as depth and length grew; this was the original reason for LSTMs and GRUs, and even those degraded as networks got larger. Transformers, by contrast, scale unusually cleanly. Add more parameters, give them more data, and performance improves on a smooth curve over many orders of magnitude, the neural scaling laws discovered by Kaplan and colleagues at OpenAI in 2020. There were no obvious cliffs. Whatever the next-best architecture might have done, the transformer kept getting better with every additional zero in the parameter count.

The third is emergent in-context learning. Once transformers crossed a certain scale, around the GPT-3 mark of 175 billion parameters, they began to do something qualitatively new: solve tasks they had never been explicitly trained on, simply by being shown a few examples in the prompt. No weight updates, no fine-tuning. The model would infer the pattern from the context window and apply it. This was not a feature anyone had asked for; it fell out of training a sufficiently large transformer on enough text. It is also the foundation of every chat-based AI product since.

By 2020, GPT-3 had demonstrated all three properties at once: it trained on hardware that no recurrent model could have used efficiently, it kept improving as the team poured more compute in, and at scale it acquired capabilities that smaller versions did not have. By 2023, frontier systems (GPT-4, Claude, Gemini) were operating at capability surfaces no recurrent network of any size had ever approached. The architecture that started as a translation experiment had become the substrate of modern AI.

What this chapter covers

The rest of this chapter develops the transformer in full. Section 13.2 derives scaled dot-product attention with worked numbers and explains the role of the $\sqrt{d_k}$ factor. Section 13.3 introduces multi-head attention and shows why splitting the projection into many parallel heads improves expressiveness. Section 13.4 distinguishes self-, cross-, and encoder–decoder attention. Section 13.5 covers positional encodings (sinusoidal, learned, rotary) that re-inject order information into the otherwise permutation-equivariant attention layer. Section 13.6 assembles the full transformer block from attention, feed-forward, residual connections and layer normalisation. Section 13.7 walks through the encoder-only, decoder-only, and encoder–decoder variants, and which architectures are used for which tasks. Section 13.8 covers causal masking and the autoregressive loss that makes GPT-style language models possible. Section 13.9 totals up parameters and FLOPs. Section 13.10 implements a complete transformer from scratch in PyTorch. Sections 13.11 and 13.12 cover BERT and GPT respectively. Sections 13.13 to 13.15 turn to the efficient attention variants (sparse, FlashAttention, linear and state-space models) that have grown up to address the quadratic cost of standard attention.

What you should take away

The transformer began as a deliberate simplification: take seq2seq with attention, and remove the recurrence.
Self-attention computes, for each position in a sequence, a soft dictionary lookup against every other position in the same sequence: global mixing in a single layer.
The two structural advantages of self-attention over recurrent layers are full parallelism across positions (every token processed simultaneously) and a constant path length between any two positions (no information has to traverse a long chain of recurrent steps).
Self-attention is permutation-equivariant on its own; positional encodings in §13.5 reintroduce order.
The transformer scaled where prior architectures had not, because it fits modern hardware, exhibits clean scaling laws across many orders of magnitude, and at sufficient size develops emergent in-context learning.