9.18 Brief survey of architectures beyond MLPs
Everything we have built so far in this chapter is a multilayer perceptron. An MLP takes a vector of numbers, multiplies it by a matrix, applies a non-linearity, and repeats. It is genuinely powerful, the universal approximation theorem in §9.5 told us that a wide enough MLP can approximate any reasonable function. So why bother with anything else?
The answer is that an MLP treats every input dimension as if it were a separate, independent feature. Pixel 137 of an image and pixel 138 are, to an MLP, two unrelated numbers; the network has to learn from scratch that they are next to each other. The first word of a sentence and the second word are two unrelated entries in a vector; the network has to learn the notion of order from the data. A node in a social graph and its friends are unrelated rows in a matrix; the network has to discover the graph from scratch.
But real-world data has structure. Pixels in an image are related to their immediate neighbours. Tokens in a sentence have an order, and words separated by a few positions are usually related. Atoms in a molecule are connected by bonds. Frames in a video follow a temporal sequence. If we know this structure ahead of time, it is wasteful to make the network rediscover it. Worse, an MLP that has to learn structure from a finite training set may simply fail to learn it at all, or learn it in a way that does not generalise.
The architectures in this section all share one idea: they bake the relevant structure of the data directly into the model. This is called an inductive bias. The result is networks that need fewer parameters to reach the same accuracy, train faster, and generalise better to new examples drawn from the same kind of structured world.
This section is a road map. Each architecture below gets one subsection: the core idea in plain English, the kind of data it suits, where the textbook treats it in depth, and one canonical paper to read first. Chapters 11 to 15 then take these in turn and develop them properly.
Convolutional neural networks (CNNs)
A convolutional network is the right tool when the input lives on a regular grid: a 2D grid for images, a 1D grid for audio waveforms, a 3D grid for medical volumes or video. Instead of connecting every input pixel to every neuron in the next layer, a CNN slides a small kernel (typically 3×3 or 5×5) across the input and computes a dot product at each position. The same kernel is used at every location, so a feature detected in the top-left of the image is detected the same way in the bottom-right. This property is called translation equivariance, and it is exactly the assumption that makes sense for natural images: a cat is still a cat ten pixels to the left.
Two inductive biases are at work. The first is local connectivity: each output depends only on a small patch of the input, not on the whole image. The second is weight sharing: the same kernel weights are reused at every spatial location. Together these slash the parameter count by orders of magnitude. A 3×3 convolution mapping a $32 \times 32 \times 3$ image to 64 output channels uses $3 \cdot 3 \cdot 3 \cdot 64 = 1728$ weights. A fully connected layer producing the same $32 \times 32 \times 64$ output would need $32 \cdot 32 \cdot 3 \cdot 32 \cdot 32 \cdot 64 \approx 200$ million weights. The CNN is roughly a hundred thousand times more parameter-efficient on this single layer alone.
Stacked convolutional layers build a hierarchy of features. The first layer learns edge and colour detectors; the second composes those into textures and corners; the third assembles object parts; the deepest layers respond to whole objects. This hierarchy emerges automatically from the data, but the architecture is what makes it possible.
The canonical early paper is LeCun et al. (1998), which introduced LeNet-5 for handwritten digit recognition. The architecture went mainstream when AlexNet (Krizhevsky, Sutskever and Hinton, 2012) won the ImageNet competition by a large margin, kicking off the modern deep-learning era. ResNet (He et al., 2015) added skip connections and pushed depth to over a hundred layers. CNNs remain the workhorse for image classification, segmentation, object detection, audio processing and time-series tasks where local structure matters. They are treated in depth in Chapter 11.
Recurrent neural networks (RNNs, LSTMs, GRUs)
A recurrent network processes a sequence one element at a time, carrying a hidden state that summarises everything seen so far. At each step the network reads a new input, combines it with the hidden state, applies a non-linearity, and produces a new hidden state:
$$\mathbf{h}_t = \sigma(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}).$$
The same weights are reused at every time step, just as a CNN reuses the same kernel at every spatial position, only here the sharing is across time rather than space. The inductive bias is that sequences have temporal order and that nearby elements are usually more related than distant ones.
Vanilla RNNs work well on short sequences but break down on long ones. The repeated multiplication by the same recurrent weight matrix causes gradients to either vanish or explode (§9.11), so the network cannot learn dependencies more than a few dozen steps back. The Long Short-Term Memory (LSTM) network of Hochreiter and Schmidhuber (1997) solves this with gated memory cells. A cell has an explicit memory line that information can flow along almost untouched, the so-called constant error carousel, together with input, forget and output gates that learn what to write, keep and read. The Gated Recurrent Unit (GRU) of Cho et al. (2014) is a simpler variant with two gates instead of three; it usually performs comparably with fewer parameters.
In principle an LSTM can learn dependencies hundreds of steps apart. In practice it is limited by the sequential nature of training: you cannot parallelise across time, because step $t+1$ depends on step $t$. This becomes the decisive bottleneck for long sequences.
The canonical papers are Hochreiter and Schmidhuber (1997) for the LSTM itself, and Sutskever, Vinyals and Le (2014), which introduced the sequence-to-sequence framework that made LSTMs the dominant architecture for machine translation. RNNs and their gated variants are treated in Chapter 12. They have largely been superseded by transformers for new work, but they remain in production speech, audio and small-footprint streaming systems where their constant memory per step is an advantage.
Transformers
The transformer replaces recurrence with self-attention. Instead of reading a sequence one element at a time, every output position attends to every input position simultaneously. Attention is computed as
$$\mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right) \mathbf{V},$$
where $\mathbf{Q}$, $\mathbf{K}$, $\mathbf{V}$ are linear projections of the input, queries, keys and values. Each query position computes a similarity score against every key position; these scores are softmax-normalised into weights, and the output is a weighted sum of the values. Because the operation is one large matrix multiplication, modern accelerators run it with extreme efficiency. The cost is $O(T^2)$ in the sequence length $T$, which is the central engineering challenge of long-context modelling.
A transformer block stacks self-attention and an MLP, both wrapped in residual connections and layer normalisation. The full architecture is just $L$ such blocks in series. There is no recurrence, no convolution, no fixed window, every token has direct access to every other token. Multi-head attention runs several attention operations in parallel with different projections, allowing the network to attend to different kinds of relationships at once.
The decisive paper is "Attention Is All You Need" by Vaswani et al. (2017). BERT (Devlin et al., 2018) showed that bidirectional pre-training produced strong representations for understanding tasks; GPT (Radford et al., 2018, 2019) and especially GPT-3 (Brown et al., 2020) showed that the same architecture, scaled up by orders of magnitude, learned a wide range of behaviours from raw text alone. Transformers now dominate language modelling, machine translation, code generation, multimodal models and most of modern NLP, and they are increasingly used for vision (ViT) and audio. They are the subject of Chapter 13.
Generative models
Everything so far has been discriminative, given an input, predict an output. Generative models try something harder: learn the distribution of the data itself, so you can sample new examples from it. Several distinct architectures share this goal.
The variational autoencoder (VAE) of Kingma and Welling (2014) trains an encoder that maps inputs to a distribution over a low-dimensional latent space, plus a decoder that maps latents back to data. Training maximises a lower bound on the data likelihood, the evidence lower bound or ELBO. The VAE produces smooth latent spaces in which interpolation is meaningful but tends to generate slightly blurry samples.
The generative adversarial network (GAN) of Goodfellow et al. (2014) sets two networks against each other. A generator produces fake samples from random noise; a discriminator tries to tell fake from real. They are trained together by a minimax game. GANs produce sharp, realistic samples but are notoriously hard to train, they oscillate, collapse onto a few modes, and require careful balancing.
Diffusion models (Ho, Jain and Abbeel, 2020) currently dominate image generation. The trick is to add noise to data in many small steps until it becomes pure Gaussian noise, then train a network to reverse one step of that process. To generate a new sample you start with noise and apply the network repeatedly, gradually denoising until a coherent image emerges. Stable Diffusion, DALL-E 3, Sora, Imagen and Midjourney are all diffusion models. Normalising flows are a fourth approach, using invertible transformations of a simple distribution; they are useful when exact likelihoods matter, as in physics applications.
These approaches are treated in depth in Chapter 14. The canonical papers are Kingma and Welling (2014) for VAEs, Goodfellow et al. (2014) for GANs and Ho et al. (2020) for denoising diffusion.
Graph neural networks (GNNs)
A graph neural network generalises convolution to data on irregular graphs: molecules, social networks, road networks, knowledge bases, computational meshes. The basic operation is message passing: each node aggregates a function of its neighbours' features, combines that with its own features, and updates its representation. After $L$ rounds of message passing, each node's representation depends on the subgraph within $L$ hops.
Variants differ in the aggregation function. The graph convolutional network (GCN) of Kipf and Welling (2017) uses a normalised sum, which is effectively a weighted average. GraphSAGE (Hamilton, Ying and Leskovec, 2017) supports sampling-based aggregation that scales to large graphs. The graph attention network (GAT) of Veličković et al. (2018) replaces the fixed weights with learned attention weights, so each node decides how much to listen to each neighbour.
GNNs power molecular property prediction (drug discovery, catalyst design), recommender systems, social-network analysis, traffic forecasting (Google Maps' ETA models, DeepMind's GraphCast for weather) and parts of AlphaFold's protein structure pipeline. A general framework for thinking about these models is the message-passing neural network of Gilmer et al. (2017), which subsumes most variants under a common notation:
$$\mathbf{h}_v^{(l+1)} = \phi\!\left(\mathbf{h}_v^{(l)}, \sum_{u \in \mathcal{N}(v)} \psi(\mathbf{h}_u^{(l)}, \mathbf{h}_v^{(l)}, e_{uv})\right).$$
This textbook gives GNNs only a brief mention; the reader interested in graph-structured data should consult Hamilton's Graph Representation Learning (2020).
Encoder-decoder architectures and U-Nets
Many tasks demand an output the same shape as the input: segment every pixel of a medical image, denoise a photograph, translate one image into another. The natural design is an encoder that compresses the input through successively smaller feature maps, paired with a decoder that expands those features back to the original resolution.
The U-Net of Ronneberger, Fischer and Brox (2015) added the crucial twist: skip connections from each encoder layer to the matching decoder layer. The encoder discards spatial detail to extract abstract features; the decoder reuses the encoder's high-resolution feature maps to put fine detail back. The shape of the encoder–decoder–skip diagram is what gives the network its name.
U-Nets are the standard backbone for biomedical image segmentation, where they were invented, and for image-to-image translation, denoising and inpainting. Modern diffusion models for image generation almost universally use a U-Net as the network that predicts the denoising step. Encoder-decoder structure also appears in machine translation (the original transformer is itself an encoder-decoder model) and in autoencoders for representation learning. The textbook returns to U-Nets in Chapter 14 as part of the diffusion model treatment.
Mixture-of-experts (MoE)
A mixture-of-experts layer is a sparsely activated alternative to a single dense MLP. There are $E$ "expert" subnetworks, each itself a small MLP, plus a router, a small network that decides, for each input token, which $k$ experts (typically $k=1$ or $2$) should process it. Only the chosen experts run; the rest are idle.
The trade-off is sharp. Total parameters scale with $E$, which can be in the hundreds, but compute per token scales with $k$ alone. A trillion-parameter MoE model can have the per-token compute of a fifty-billion-parameter dense one. Training requires careful load balancing, without an auxiliary loss, the router learns to send everything to a small handful of experts and the rest die, but the technique now powers many of the largest deployed models.
MoE first appeared in Shazeer et al. (2017) ("Outrageously Large Neural Networks"); GShard, Switch Transformer (Fedus, Zoph and Shazeer, 2021), Mixtral, DeepSeek-V3 and reportedly GPT-4 all use it. The motivation is economic as much as scientific: MoE lets you grow model capacity without growing the inference bill at the same rate.
State-space models (Mamba)
Transformers scale brilliantly, but their attention cost grows quadratically with sequence length. For very long sequences, entire books, hour-long audio, genome fragments, this becomes prohibitive. State-space models offer an alternative. They are based on a continuous-time linear dynamical system, $\mathbf{h}'(t) = \mathbf{A}\mathbf{h}(t) + \mathbf{B}\mathbf{x}(t)$, $\mathbf{y}(t) = \mathbf{C}\mathbf{h}(t)$, discretised in time. With the right parameterisation they can be evaluated either as a recurrence (constant memory per step) or as a long convolution (parallelisable across time), giving the best of both worlds.
The S4 model (Gu, Goel and Ré, 2022) showed that this could match transformer quality on long-range benchmarks. Mamba (Gu and Dao, 2023) made the state-space matrices input-dependent, selective state-space models, and matched transformer language-modelling quality at sub-quadratic cost. State-space models are an active research area: Mamba-2, hybrid SSM-attention models such as Jamba, and applications to vision, audio and DNA are appearing rapidly. Their long-term role in the architectural landscape is not yet settled.
Mixture of architectures and what's next
The clean taxonomy above is convenient for teaching but increasingly misleading in practice. Modern frontier models combine ideas freely. A typical large language model in 2026 is a transformer backbone with multi-head attention, but with the dense feed-forward block replaced by a mixture-of-experts layer, with rotary or ALiBi positional embeddings, with FlashAttention for efficient kernels, with grouped-query attention to shrink the KV cache, sometimes with Mamba blocks for ultra-long context, with retrieval modules for external knowledge, with a vision-transformer branch for image input, and with a diffusion-based head for image generation. ResNet-style residual connections are everywhere; layer normalisation is everywhere; encoder-decoder structure recurs at multiple scales.
The lesson is that architecture is not a single choice but a vocabulary of components that can be composed. Knowing the components, what each one is good at, what its inductive bias is, what it costs, is more useful than picking a winner. The chapters that follow develop each component in turn.
What you should take away
- An MLP has no built-in structure; it treats every input dimension as independent. Specialised architectures bake the structure of the data, grids, sequences, graphs, directly into the network. This is called an inductive bias and it pays off in fewer parameters, faster training and better generalisation.
- CNNs (Chapter 11) handle grid data through translation-equivariant convolutions. RNNs and LSTMs (Chapter 12) handle sequences with recurrence. Transformers (Chapter 13) handle sequences with attention and now dominate language and increasingly vision.
- Generative models (Chapter 14), VAEs, GANs and diffusion, learn data distributions rather than input-to-output mappings. Diffusion models with U-Net backbones currently lead image and video generation.
- Graph neural networks generalise convolution to irregular graphs. Encoder-decoder networks, mixture-of-experts and state-space models such as Mamba are additional components that mix-and-match with the others.
- Frontier models combine many of these ideas in a single system. Treat architectures as a vocabulary of composable components, not a single competition with one winner.