The Transformer, introduced by Vaswani et al. in the landmark 2017 paper "Attention Is All You Need," is an architecture based entirely on self-attention, dispensing with recurrence and convolution. It consists of an encoder and a decoder, each built from a stack of identical layers. The encoder maps an input sequence to continuous representations; the decoder generates output tokens autoregressively, attending to both previously generated tokens and the encoder output.
Each encoder layer contains two sub-layers: multi-head self-attention followed by a position-wise feed-forward network (a two-layer MLP applied independently to each position). Each sub-layer is wrapped in a residual connection and followed by layer normalisation. The decoder adds a third sub-layer, cross-attention, where queries come from the decoder and keys/values come from the encoder output—the direct descendant of the original seq2seq attention. Since self-attention is permutation-equivariant, positional encoding is added to input embeddings to inject order information.
The transformer's impact has been revolutionary. It trains much faster than RNNs because all positions are processed in parallel. It captures long-range dependencies without the information bottleneck of recurrence. And it scales elegantly to billions of parameters. Within a few years of its introduction, transformer-based models achieved state-of-the-art results in machine translation, language modelling (GPT), language understanding (BERT), image classification (ViT), speech recognition, protein structure prediction (AlphaFold 2), and many more tasks. The transformer has become the universal building block of modern AI, from chatbots to text-to-image systems to scientific discovery tools.
Related terms: Self-Attention, Multi-Head Attention, Positional Encoding, BERT, GPT
Discussed in:
- Chapter 13: Attention & Transformers — The Transformer
Also defined in: Textbook of AI, Textbook of Medical AI