CNNs: 11.3   CNN architectures

Dr Chris Paton

11.3 CNN architectures

The first convolutional neural network that worked at industrial scale, LeNet-5, was published in 1998. It read postal codes and bank cheques, and for a decade afterwards it sat alone, a neat idea that everyone admired but few people knew how to push further. By 2012 the situation had reversed completely. A network called AlexNet won the ImageNet competition by a huge margin, and within three years researchers were routinely training networks ten times deeper than anything LeNet ever attempted. The architectures that emerged in that whirlwind decade are the ancestors of every vision system you use today, from the face unlock on your phone to the tumour detector in a hospital scanner.

This section walks the major lineage. The headline ideas are easy to state in plain English. Replace the slow saturating sigmoid with ReLU, a flat non-linearity that does not get stuck. Stack many small filters instead of one large one, same view of the world, fewer parameters. Run several filter sizes in parallel and let the network decide which scale to use. Add shortcut connections that bypass blocks of layers, so that when a network gets very deep the gradient can still find its way back to the early weights. Replace expensive standard convolutions with cheap depthwise separable ones for phones and watches. Each generation of architectures encoded the lessons of the previous one, and the lessons accumulated quickly.

This section puts the building blocks of §§11.1–11.2 together into named, historically important full networks. §§11.4–11.7 then cover what we do with trained CNNs: visualising filters, detecting objects, segmenting scenes, and transferring features.

Symbols Used Here

$H, W, C$height, width, channels

$N$batch size

$k$kernel size

LeNet-5 (1998)

Yann LeCun and his collaborators at Bell Labs, Léon Bottou, Yoshua Bengio and Patrick Haffner, published the canonical reference, Gradient-Based Learning Applied to Document Recognition, in the Proceedings of the IEEE in 1998. The model itself, LeNet-5, had been deployed for several years before the paper appeared: by the late 1990s about one in ten cheques in the United States was being read by a LeNet variant.

The architecture is the template every CNN textbook still draws first. A 32 × 32 grayscale image flows through a convolution layer (six 5 × 5 filters, producing six feature maps), then a 2 × 2 average pooling layer that halves the spatial size, then another convolution with sixteen filters, then another pool, then two fully connected layers, and finally a Gaussian radial-basis output that compares the final feature vector to ten learned digit prototypes. The pattern, alternating conv and pool, then a small dense head, is the bedrock of every CNN in this section.

Total parameter count: about 60,000. By modern standards that is tiny; an iPhone 16 fits a model a thousand times larger without breathing hard. But in 1998 a workstation needed several seconds per image, and the MNIST dataset of handwritten digits was about as much data as the field could comfortably handle. LeNet's activation function was the saturating hyperbolic tangent, $\tanh$, which made gradients vanish in deep stacks; its pooling was averaging, not the max we now prefer; it had no batch normalisation and certainly no GPU. Each of those limitations would, in time, find its fix. For now the lesson to take is simply that the LeNet template, conv, pool, conv, pool, dense, output, was the right idea fourteen years before the rest of the field was ready to exploit it.

AlexNet (2012)

The modern CNN era begins at one event: the ImageNet Large Scale Visual Recognition Challenge of 2012, where the winning entry was announced at an ECCV workshop in October. Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton had submitted AlexNet to the challenge, an annual benchmark in which entrants had to classify a million photographs into a thousand fine-grained categories such as Norwegian elkhound or toaster. Until that morning the leaderboard was dominated by clever hand-engineered features, SIFT descriptors, Fisher vectors, support vector machines, improving by a fraction of a percentage point each year. AlexNet won with a top-5 error of 15.3%, beating the second-place hand-engineered system by more than ten percentage points. That gap was unprecedented. Within a year every serious entry to ImageNet was a deep CNN, and an entire generation of computer-vision research changed direction.

The architecture itself is recognisably a scaled-up LeNet. Five convolutional layers and three fully connected layers feed into a 1,000-way softmax classifier. The first conv uses 96 filters of size 11 × 11 with stride 4, an aggressive subsampling that immediately reduces a 224 × 224 input to 55 × 55. Subsequent convs use smaller 5 × 5 and 3 × 3 filters. The whole network has roughly 60 million parameters, a thousand times more than LeNet. It was trained for six days on two NVIDIA GTX 580 GPUs (3 GB of memory each), with the layers split across both cards because the model did not fit on one, a pragmatic detail that gave the architecture its distinctive two-stream diagram.

Several novel ingredients made the difference between AlexNet and a merely larger LeNet. Most important was the choice of activation: ReLU, $\max(0, x)$, in place of the saturating $\tanh$. ReLU does not flatten out for large positive inputs, so gradients keep flowing through deep stacks. The original paper measured a sixfold speed-up in convergence from this change alone. Next was dropout at rate 0.5 in the two big fully connected layers, which forced the network not to rely on any single neuron and dramatically reduced overfitting. Then came data augmentation: random crops, horizontal flips, and a clever colour jitter based on principal-component analysis of the training pixels, which effectively multiplied the size of ImageNet several times over. Finally there was local response normalisation, a biologically inspired competitive normalisation across channels that was used after the first two convs; it has since been superseded by batch normalisation and is now of historical interest only.

A useful thing to notice about AlexNet's parameter count is where it lives. The first conv has about 35,000 parameters; the first fully connected layer (FC6), going from a 9,216-element flattened feature vector to 4,096 outputs, has about 37.7 million, roughly a thousand times more. The convolutions do almost all of the computation but hold a small fraction of the parameters. This imbalance is corrected in later architectures by replacing the giant FC head with global average pooling, but in 2012 the giant head was simply how things were done.

VGG (2014)

Two years after AlexNet, Karen Simonyan and Andrew Zisserman of the Visual Geometry Group at Oxford published Very Deep Convolutional Networks for Large-Scale Image Recognition. Their VGG architecture pushed regularity to its logical conclusion. Every convolution is 3 × 3 with stride 1 and one pixel of zero-padding; every pooling layer is 2 × 2 with stride 2; channel counts double after each pooling stage. The only thing that changes between VGG-11, VGG-13, VGG-16 and VGG-19 is how many of those tiny conv layers are stacked. The 16- and 19-layer variants are the ones still cited today.

VGG's central insight is that a stack of small filters is strictly better than a single large one, given enough data and compute. Two stacked 3 × 3 convolutions see a 5 × 5 patch of the input, the receptive field grows by $k - 1$ at each layer, so after two 3 × 3 layers it has grown by $2 + 2 = 4$, taking the original 1-pixel field to 5. But two 3 × 3 layers carry $2 \cdot 3 \cdot 3 \cdot C^2 = 18 C^2$ parameters versus $5 \cdot 5 \cdot C^2 = 25 C^2$ for a single 5 × 5 layer with the same channel count. Three stacked 3 × 3 convolutions match a 7 × 7 receptive field with $27 C^2$ parameters versus $49 C^2$. And, crucially, the stack contains an extra ReLU non-linearity in the middle, giving the network more expressive power per parameter. The principle, prefer depth and small kernels over width and large kernels, has held up for a decade.

VGG-16 has 138 million parameters, almost all of them in three enormous fully connected layers at the end. It reached 7.3% top-5 error on ImageNet, less than half of AlexNet's. Its architecture is so regular and its features so well-behaved that VGG remains a popular feature extractor today: the activations of intermediate layers such as conv3_3 and conv4_3 capture mid-level texture in a way that is useful as a perceptual loss for style-transfer and super-resolution networks, even though VGG itself is rarely the right choice as a classification backbone in 2026.

GoogLeNet / Inception (2014)

In the same 2014 ImageNet competition that VGG entered, Christian Szegedy and colleagues at Google introduced GoogLeNet (a deliberate pun on LeNet). It was 22 layers deep, achieved a top-5 error of 6.67%, beating VGG outright, and did so with roughly twenty times fewer parameters. The paper, Going Deeper with Convolutions, introduced the Inception module, which is the architectural idea that made the difference.

The motivating worry was that an image contains features at many scales, a small filter is right for an eye, a larger one for a face, a pooling operation for a textureless background, and choosing one kernel size for an entire layer forces a compromise. Inception sidesteps the choice by computing several in parallel and concatenating the results channel-wise. A single Inception module takes the input feature map, runs four operations on it simultaneously, a 1 × 1 convolution, a 3 × 3 convolution, a 5 × 5 convolution, and a 3 × 3 max pool, and stacks the four output feature maps along the channel axis to form the block's output. Multi-scale features in one block, with the network free to learn which scales matter where.

The trick that makes this affordable is the 1 × 1 convolution. A 1 × 1 conv with $C_{\text{in}}$ input channels and $C_{\text{mid}}$ output channels has $C_{\text{in}} \cdot C_{\text{mid}}$ parameters and is essentially a learned linear projection across the channel axis at every spatial location. In Inception, a 1 × 1 conv is placed before each expensive 3 × 3 or 5 × 5 conv to reduce the channel count from, say, 256 to 64, so that the expensive filter operates on a much narrower stack. The 1 × 1 conv is the cheapest way to mix information across channels, and it crops up everywhere in modern deep learning: as the pointwise half of depthwise separable convolutions in MobileNet, as the projection shortcut in ResNet, and as the value, key and query projections of self-attention.

GoogLeNet stacks nine Inception modules and finishes not with a giant fully connected head but with global average pooling, averaging each final feature map down to a single number per channel, followed by a single small classifier layer. Together, the 1 × 1 bottlenecks and the GAP head are why the parameter count is only 6.8 million, about twenty times smaller than VGG-16. Inception has gone through several revisions: Inception-v2 added batch normalisation, Inception-v3 (2015) refactored larger filters into stacks of asymmetric 1 × 7 and 7 × 1 convolutions, and Inception-v4 / Inception-ResNet (2016) merged the Inception module with the residual connections we are about to meet.

ResNet (2015)

If you take only one architectural lesson from this chapter, take this one. The residual connection, introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun at Microsoft Research Asia in Deep Residual Learning for Image Recognition (CVPR 2016, posted on arXiv in December 2015), is the single most important architectural innovation in the history of deep learning. ResNet won ImageNet 2015 with a top-5 error of 3.57%, the first system to beat the rough human baseline (around 5% on this benchmark), and the same idea has since propagated into Transformers, into diffusion models, into protein folders. When you read about a "skip connection" or a "residual stream" in a 2026 paper, you are reading about the descendants of ResNet.

The motivating problem was a puzzle that had been troubling the field for a year or two. As researchers built ever-deeper plain CNNs, they had expected accuracy to keep climbing. Instead it got worse. He et al. noted that a 56-layer plain CNN had higher training error than a 20-layer plain CNN on CIFAR-10, even though the deeper network was strictly more expressive, it could in principle replicate the shallow one by setting its extra layers to the identity. They called this the degradation problem. It was not overfitting (otherwise test error would worsen but training error would still drop). It was an optimisation failure: stochastic gradient descent simply could not find good weights for very deep plain networks.

The fix is to add a shortcut connection that skips two or three layers and adds the input directly to the output:

$$ \mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}. $$

In words: instead of asking the layers to learn the desired mapping $H(\mathbf{x})$ from scratch, we ask them to learn the residual $\mathcal{F}(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$. If the optimal mapping happens to be the identity, the network only needs to push $\mathcal{F}$ towards zero, which is much easier than learning the identity through a stack of nonlinear layers. If the optimal mapping is close to the identity, the network only needs to learn a small correction. Both situations are common, and both are now easy for SGD to discover.

There is an equally important reason the fix works, and it is about gradients. Differentiating $\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}$ with respect to the input gives

$$ \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathcal{F}(\mathbf{x})}{\partial \mathbf{x}} + I. $$

That identity term means the gradient flowing back through the block has a strict lower bound; it cannot vanish even if $\partial \mathcal{F} / \partial \mathbf{x}$ does. In a deep stack of residual blocks, the chain rule ends up with a sum of paths through the shortcuts, not just a single product through the layers, so gradients reach early weights without being multiplicatively attenuated. Vanishing gradients, the bane of deep networks since the 1990s, are tamed.

The basic residual block contains two 3 × 3 convolutions with batch normalisation and ReLU, plus the shortcut. When the spatial size or channel count changes across a block, the shortcut becomes a 1 × 1 convolution that projects the input to match. Deeper ResNets (ResNet-50, ResNet-101, ResNet-152) use a bottleneck variant in which each block has three layers: a 1 × 1 conv to shrink the channel count, a 3 × 3 conv at the reduced count, and a 1 × 1 conv to expand back. The bottleneck is dramatically cheaper than the basic block at high channel counts, which is what makes very deep ResNets affordable.

ResNet-50 (a 7×7 stem followed by four stages of bottleneck blocks with 3, 4, 6, and 3 blocks per stage, 25.6 million parameters, about 4.1 GFLOPs for one 224 × 224 forward pass) is the workhorse of computer vision. It is the default backbone for most detection and segmentation systems, the default feature extractor for transfer learning, and the de facto baseline against which every new architecture is compared. Ten years after publication it is still in production use across the industry.

Variants and extensions

After ResNet, the field spent several years exploring variations on its theme. Each of the following is a useful idea in its own right, and each is a useful name to recognise.

DenseNet (Gao Huang et al., 2017). Take residual connections to the limit: every layer is connected to every preceding layer within a dense block, not just the previous one. The output of layer $\ell$ is $\mathbf{x}_\ell = H_\ell([\mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_{\ell-1}])$, where $[\cdot]$ denotes concatenation along the channel axis. Each layer adds a small number $k$ of new feature maps, the growth rate, and reuses everything that came before. Feature reuse is maximal, gradients reach early layers along very short paths, and DenseNet-121 matches ResNet-50's accuracy with only 8 million parameters. The downside is memory: naive implementations accumulate feature maps and consume a lot of activation memory at training time.

MobileNet (Andrew Howard et al., 2017). The first widely deployed mobile-grade CNN. MobileNet replaces every standard convolution with a depthwise separable block, a depthwise conv (one filter per input channel, no cross-channel mixing) followed by a 1 × 1 pointwise conv (cross-channel mixing only), for an 8–9× reduction in compute at almost no loss of accuracy. Two hyperparameters, a width multiplier $\alpha$ and a resolution multiplier $\rho$, let practitioners trade accuracy for cost on a continuum from desktop down to watch. MobileNet-v2 (2018) added inverted residual blocks, expanding the channel count before the depthwise conv rather than after, which surprisingly works better. MobileNet-v3 (2019) layered on squeeze-and-excitation attention and used neural architecture search to choose layer-by-layer hyperparameters.

EfficientNet (Mingxing Tan and Quoc Le, 2019). A clean answer to the question, "given a fixed compute budget, what is the optimal way to scale up?" The recipe is compound scaling: scale depth, width, and input resolution together by powers of a single coefficient $\phi$, with constants $\alpha, \beta, \gamma$ chosen so that doubling $\phi$ doubles compute. Variants B0 through B7 use $\phi = 0, 1, \ldots, 7$, growing from 5.3 million parameters at B0 to 66 million at B7. EfficientNet-B7 reached 84.3% ImageNet top-1 in 2019 with a fraction of the parameters of comparable CNNs.

ConvNeXt (Zhuang Liu, Hanzi Mao et al., 2022). For a few years after Vision Transformers arrived in 2020, conventional wisdom held that Transformers had simply superseded CNNs. A ConvNet for the 2020s pushed back. Starting from a ResNet-50 baseline, the authors applied the design choices that distinguish modern Transformers, patchify stem, larger 7 × 7 depthwise kernels, inverted bottleneck, GELU activations, LayerNorm in place of BatchNorm, fewer normalisation and activation layers per block, one at a time, and produced a pure CNN that matches or beats Swin Transformer on ImageNet at equal compute. ConvNeXt is a useful reminder that recent progress is partly architectural folklore and partly genuinely novel, and it is hard to tell which is which until someone does the careful ablation.

When to use each

You will rarely train any of these from scratch in 2026. The practical question is which pretrained backbone to start from for a new task.

For most computer-vision projects, ResNet-50 is still the right first try. It is the most widely supported architecture in every framework, every model zoo and every transfer-learning recipe; it gives a strong baseline; and if it works you are done. EfficientNet-B0 to B3 are good when compute matters: they hit similar accuracy at a fraction of the FLOPs. MobileNet-v3 is the right choice for on-device deployment where latency and memory budgets are tight.

If you have a lot of training data and a lot of compute, ConvNeXt competes with vision transformers and is often a better choice when you want the convolutional inductive bias for sample efficiency. Vision Transformers (ViT), which we meet in Chapter 13, win when pretraining data is abundant, JFT-300M, LAION-5B and beyond, and they integrate naturally with multimodal text-image models such as CLIP. For image generation, the workhorse is no longer a pure classifier at all but the U-Net used inside diffusion models, which we discuss in Chapter 14. And for medical imaging, nnU-Net remains the strong default for segmentation tasks despite a decade of newer ideas.

In short: ResNet for safety, EfficientNet for budget, ConvNeXt for accuracy at scale, MobileNet for mobile, ViT for very-large-data regimes, U-Net for generation. None of these are dead architectures; all six are in production somewhere in 2026.

What you should take away

The lineage matters because each generation fixes the previous one's bottleneck. LeNet showed the conv-pool-dense template worked. AlexNet replaced $\tanh$ with ReLU and added dropout, augmentation and GPU training. VGG replaced large kernels with stacks of small ones. Inception added multi-scale parallel paths and 1 × 1 bottlenecks. ResNet added the residual shortcut. MobileNet added depthwise separable convolutions. Each step was a single, articulable idea.
The residual connection $\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}$ is the single most important architectural innovation in deep learning. It tames the degradation problem, prevents vanishing gradients, and now appears in every Transformer, every diffusion model, and every modern CNN. Internalising why it works is more valuable than memorising any specific architecture.
The 1 × 1 convolution is everywhere. It is the cheapest way to mix information across channels. You will meet it as the bottleneck in Inception, the projection shortcut in ResNet, the pointwise half of depthwise separable convolutions in MobileNet, and the value, key and query projections of self-attention. Recognising it across contexts is a useful unifying skill.
Parameter count and compute are not the same thing. AlexNet, VGG and the early CNNs put almost all of their parameters in the fully connected head while the convolutions did almost all of the computation. Modern architectures correct this with global average pooling and stay parameter-light.
In 2026 you almost never train from scratch, you fine-tune from a pretrained backbone. ResNet-50 is the safe default, EfficientNet the budget choice, ConvNeXt the accuracy-at-scale choice, MobileNet the on-device choice, and ViT the very-large-data choice. Picking the right backbone for the right setting is the practical skill that this section equips you with.