Sequence Models: 12.16   Why Transformers replaced recurrent networks

Dr Chris Paton

12.16 Why Transformers replaced recurrent networks

By 2017, recurrent architectures had a clear set of structural limitations.

Sequential computation. An RNN must process tokens one at a time; you cannot compute $h_t$ without first having computed $h_{t-1}$. This places a hard ceiling on how much GPU parallelism can be exploited. As datasets grew into the billions of tokens and models grew into hundreds of millions of parameters, the cost of training became dominated by wall-clock time per epoch, not by FLOPs per epoch, and recurrent networks were stuck on the wrong side of that trade-off.

Long-range dependencies are still hard. LSTMs and GRUs vastly improved on vanilla RNNs, but they still rely on information being preserved through hundreds of recurrence steps, and the cell-state-as-highway is finite-dimensional. Empirically, attention augmented seq2seq models worked partly because they let the decoder bypass the recurrence and look back directly.

Attention is enough. Vaswani et al. 2017 asked the natural follow-up question: if attention from decoder to encoder is so useful, what happens if we use attention everywhere, including from each position of a sequence to each other position of the same sequence, and remove the recurrence altogether? The answer was the Transformer. It computes representations of every position in parallel, has constant path length between any two positions (a direct attention edge, regardless of how far apart they are), and scales well to long sequences and large datasets.

The Transformer did not just outperform recurrent models on translation. It enabled the next decade of progress: BERT, GPT, T5, the entire family of modern language models, and the multimodal extensions to vision and audio. The recurrent-network era is by no means over; recurrent structures persist in speech and time-series work, and the recent state-space models (S4, Mamba) explicitly revive the recurrence-with-gating idea for very long sequences, but the centre of gravity of the field shifted decisively in 2017, and Chapter 13 picks up that story.