CNNs: 11.1   Convolution

Dr Chris Paton

11.1 Convolution

2D convolution, kernel slides over input · 1:10. A 3×3 kernel sweeps a 9×9 input, filling in a feature map cell by cell. Open transcript and references →

Convolutional neural networks (CNNs) are the architecture that made image recognition work. Where a fully connected network treats every pixel as an independent feature unrelated to its neighbours, a CNN bakes in two priors that match the structure of natural images. The first is **local connectivity**: each neuron in a convolutional layer looks only at a small spatial patch of its input, on the assumption that the most informative visual features, edges, corners, blobs of colour, fragments of texture, live in small neighbourhoods rather than at opposite ends of the image. The second is **weight sharing**: the same small bank of weights, called a **filter** or **kernel**, is applied identically at every spatial position. A vertical-edge detector that works in the top-left corner is exactly the same vertical-edge detector that works in the bottom-right.

Together these two constraints produce three consequences. They reduce parameter counts by orders of magnitude relative to a fully connected layer of comparable capacity. They build in translation equivariance, shifting the input shifts the output by exactly the same amount, so the network does not have to relearn each visual pattern at every position. And they produce a representation that composes cleanly through depth: stacking convolutional layers builds progressively larger receptive fields and progressively more abstract features, mirroring the hierarchical organisation that Hubel and Wiesel discovered in the mammalian visual cortex. This section motivates the convolution operation, introduces its formal definition, walks through its hyperparameters, and prepares the ground for the architectural lineage that follows.

This chapter specialises the deep-learning machinery of Chapters 9–10 to grid-structured data: images first and foremost, but also one-dimensional signals such as audio waveforms and time series, and three-dimensional volumes such as medical scans and video. Chapters 12 and 13 then turn to sequence modelling and the attention mechanism, which together produce the Transformer architecture that now coexists with, and in some settings displaces, the CNN.

Symbols Used Here

$\mathbf{X}$input feature map

$\mathbf{K}$convolution kernel (filter)

$\mathbf{Y}$output feature map

$H, W, C$height, width, channels

$k$kernel size

$s$stride

$p$padding

The motivating problem: full-connectivity is wasteful

Consider an unremarkable colour image at the standard ImageNet resolution of 224×224 pixels with three colour channels (red, green, blue). Flatten this into a vector and you have $224 \times 224 \times 3 = 150{,}528$ features. A single fully connected layer mapping this vector into 1{,}000 hidden units has $150{,}528 \times 1{,}000 \approx 1.5 \times 10^8$ weight parameters in that one layer alone. A network deep enough to be useful might have ten or twenty such layers, and the parameter count balloons into the billions before we have started to think about training data or generalisation. Most of those parameters are wasted, in three precise senses.

First, the parameters are spatially redundant. A pixel in the top-left corner of the image and a pixel in the bottom-right corner have, at low layers, no direct semantic relationship. Whatever cat-detection circuit the network learns for the top-left corner is, for a fully connected layer, a wholly separate circuit from any cat-detection it learns for the bottom-right corner. Each circuit uses different weights, because each input pixel has its own column of weights. The network has to learn the same operation many times over, once for every position the operation might be needed.

Second, the parameters are not robust to translation. Shift the image by a single pixel and the input vector becomes a completely different point in the 150{,}528-dimensional input space. A network that recognised the original input has no guarantee of recognising the shifted one, the weights tied to the new pixel locations may simply not have been trained to respond to that pattern. A fully connected classifier therefore has to be trained on every translation of every visual pattern of interest, with the parameter count and training-data appetite that implies.

Third, the parameters do not respect locality. A fully connected layer treats every pair of input pixels as potentially related, including pairs at opposite corners that share no useful information except via the chain of intermediate neighbourhoods between them. Modelling that long-range relationship directly, with its own weight, is wasteful when the relationship is in fact mediated by, and reconstructable from, local interactions.

The convolutional layer addresses all three problems with a single, deceptively simple change of architecture: replace the dense matrix multiplication with a small filter that slides across the input. The filter has the same weights wherever it lands, which fixes spatial redundancy and translation. It only sees a small neighbourhood at any one time, which respects locality. The constraint is not arbitrary, it is precisely the symmetry that natural images possess. A useful inductive bias is one that matches reality, and translation symmetry is as close to a universal property of natural-image statistics as we have.

Convolution: a small filter slides across the input

The basic operation is mechanical. Take a small filter, say a $3 \times 3$ matrix of weights, and slide it across the input image. At each position, multiply the filter element-wise with the underlying $3 \times 3$ patch of pixels, sum the nine products, and write the resulting scalar into the corresponding position of the output. Move the filter one step to the right and repeat. When you reach the end of a row, drop down one step and start again from the left. The output is a new image, usually called a feature map, whose value at each position records how strongly the filter pattern matches the input at that position.

To make this concrete, take a $5 \times 5$ patch of an image and the classical Prewitt horizontal-edge kernel

$$ \mathbf{K} = \begin{pmatrix} 1 & 1 & 1 \\ 0 & 0 & 0 \\ -1 & -1 & -1 \end{pmatrix}. $$

This kernel responds to brightness gradients running from top to bottom: positive weights along the top row, negative along the bottom, zeros across the middle. Where the underlying patch contains a horizontal edge (bright above, dark below) the inner product of the kernel with the patch is large and positive. Where the patch contains the opposite edge (dark above, bright below) the response is large and negative. Where the patch is uniform, the positive contributions from the top row cancel the negative contributions from the bottom row and the response is approximately zero.

The same kernel slid across an entire image therefore produces an edge map: a feature map in which large absolute values flag locations of horizontal edges and zeros flag flat regions. A second kernel, with positive weights down the left column and negative weights down the right, produces a vertical-edge map. A third, organised as a centre-surround pattern, produces a blob detector. Each kernel produces one feature map per input channel; if a layer applies $C_{\text{out}}$ kernels in parallel, its output is a stack of $C_{\text{out}}$ feature maps, often called a feature volume.

Convolution, defined this way, is in fact what mathematicians call cross-correlation: a true convolution flips the kernel before multiplying. Because the kernel is learned by gradient descent rather than specified by hand, the flip is irrelevant; gradient descent will discover whichever orientation minimises the loss, and the network does not care which name we use for the operation. Deep-learning frameworks therefore call it convolution without apology, and we shall do the same.

The hand-written Prewitt and Sobel filters of classical image processing have an afterlife in modern CNNs. When you train a deep network from scratch on natural images and inspect the kernels learned by the very first convolutional layer, you find, without telling the network this is what you wanted, exactly such oriented edge detectors, plus colour-opponent blob detectors and small textured patches. This is not a coincidence. It is a direct empirical confirmation that oriented edges are the optimal first-layer features for natural images, and that the convolutional architecture is the right inductive shape for finding them.

Two key properties

The two architectural commitments of a convolutional layer can be stated very compactly.

Local connectivity. Each output unit looks at only a $k \times k$ patch of the input, where $k$ is small (typically 3, occasionally 5, 7 or 11 in older or specialised architectures, and 3 in nearly every modern design).
Weight sharing. The same filter is applied at every spatial position of the input, so the same set of weights produces the response at every location.

The consequences are dramatic. The parameter count of a single filter is $k \times k$, and crucially, this is independent of the image size. A filter that processes a $32 \times 32$ image has the same number of weights as a filter that processes a $1024 \times 1024$ image. A layer with $C_{\text{in}} = 3$ input channels, $C_{\text{out}} = 64$ output channels and a $3 \times 3$ kernel has $3 \cdot 3 \cdot 3 \cdot 64 = 1{,}728$ weights, plus 64 bias terms, a total of 1{,}792 parameters, utterly negligible compared with the 150 million of the fully connected layer that processes the same input. Even a deep stack of layers like this comes in well under a million parameters; production CNNs grow much larger because they widen channels and stack many such layers, but the per-layer parameter cost remains modest in absolute terms, leaving the bulk of the model's capacity to be allocated where it matters, at the head of the network, or distributed across many channels, rather than wasted on per-pixel weights.

The contrast is the central economic argument for convolutional architectures. A fully connected layer mapping a 224×224 RGB image to 1{,}000 outputs uses $\sim 1.5 \times 10^8$ parameters; a convolutional layer with 64 filters of $3 \times 3$ uses $1{,}792$ parameters. The convolutional layer is roughly eighty thousand times smaller. Stacking ten such convolutions still costs less than 0.1% of the fully connected baseline. This is what makes deep visual models practical at all.

The second consequence is translation equivariance. Formally, if $T_v$ denotes a spatial shift by vector $v$ and $f$ denotes the convolution operation, then $f(T_v \mathbf{X}) = T_v f(\mathbf{X})$. Shifting the input by one pixel produces an output that is identical to the original output, also shifted by one pixel. This property is preserved through stacks of convolutions: a CNN built from convolutions, ReLUs and pooling layers, without any global, spatially aware operations, is equivariant from input to output. Pooling layers add a controlled amount of invariance, where small input shifts produce no change at all in the output. The combination of equivariant convolutions and locally invariant pooling is what gives modern vision models their robustness to small translations of objects within the frame.

Channels and feature maps

Real images have three input channels (red, green, blue), and the input to a convolutional layer is therefore a three-dimensional tensor of shape $(C_{\text{in}}, H, W)$. A single filter must process all input channels jointly, so the filter itself is a three-dimensional tensor of shape $(C_{\text{in}}, k, k)$, with one $k \times k$ slice per input channel. The filter response at a given spatial position is computed by element-wise multiplication of all three slices with the corresponding three-channel patch of the input, summing all $C_{\text{in}} \cdot k \cdot k$ products into a single scalar.

A layer applies many such filters in parallel, typically 64, 128, 256 or 512 of them, and concatenates their outputs along a new channel axis. The output of the layer is therefore another three-dimensional tensor, of shape $(C_{\text{out}}, H_{\text{out}}, W_{\text{out}})$, where each of the $C_{\text{out}}$ output channels is the feature map produced by one filter. Hidden layers in a CNN consequently have their own channel dimensions, and a typical architecture progresses from 3 channels (input) to 64, then 128, then 256, then 512 (deepest convolutional layer), doubling channel count whenever spatial resolution is halved so that total computation per layer stays roughly constant.

Each output channel learns to detect a specific pattern. Empirically the first convolutional layer learns oriented edges, colour-opponent blobs, and small textured patches. The second and third layers learn corners, junctions, contour fragments, and texture combinations such as fur, mesh, and brick. The middle layers learn object parts: wheels, eyes, fingers, leaves. The deepest layers learn whole-object templates: faces, dogs, cars, text. This progression is emergent; no part of the loss function instructs the network to organise itself this way, and it mirrors the well-documented hierarchy of the primate visual cortex from V1 through inferotemporal cortex.

Stride and padding

Two simple hyperparameters control how the filter sweeps across the input.

Stride $s$ is the distance the filter moves between samples. With $s = 1$ the filter visits every position, and the output spatial resolution roughly matches the input. With $s = 2$ the filter visits every other position and the output is downsampled by a factor of two in each dimension. Stride is the principal mechanism by which convolutional layers reduce spatial resolution while increasing channel depth.

Padding $p$ adds $p$ rows of zeros around the top and bottom and $p$ columns of zeros around the left and right of the input before convolution. Without padding (the so-called "valid" mode, $p = 0$) the output shrinks by $k - 1$ along each axis, since the filter cannot place its centre on the border pixels. With $p = (k-1)/2$ for odd $k$, known as "same" padding, the output retains exactly the input spatial size when $s = 1$. Padding lets us stack many convolutions without the feature map shrinking to nothing, and it avoids the boundary artefacts that arise when interior pixels are sampled by the filter many more times than border pixels.

The general formula for output spatial size, given input size $N$, kernel $k$, padding $p$ and stride $s$, is

$$ N_{\text{out}} = \left\lfloor \frac{N + 2p - k}{s} \right\rfloor + 1. $$

A small library of cases is worth committing to memory: $k = 3, s = 1, p = 1$ preserves spatial size (the workhorse of every modern architecture); $k = 3, s = 2, p = 1$ halves spatial size; $k = 1, s = 1, p = 0$ leaves spatial size unchanged but mixes channels (the "pointwise convolution"). A more general formula, allowing for dilation $d$, implicit zeros inserted between kernel taps, replaces $k$ with the effective kernel extent $1 + d(k-1)$.

Pooling

Pooling layers are dedicated downsamplers. Where convolutions have learnable weights and can mix channels, pooling has no parameters and acts on each channel independently.

Max pooling takes a small window, typically $2 \times 2$, and replaces each non-overlapping window with the maximum activation it contains. The effect is to halve the spatial resolution while keeping the strongest evidence for whatever feature each channel detects. Max pooling adds a small amount of translation invariance, since shifting the input by up to one pixel within the pooling window leaves the maximum unchanged. Average pooling replaces each window with its arithmetic mean and is smoother but generally slightly less effective for classification, where strong-response retention matters more than averaging.

Global average pooling is a special case in which the pooling window covers the entire spatial extent. Applied after the deepest convolutional layer, it reduces a feature map of shape $(C, H, W)$ to a vector of length $C$, one number per channel. A linear classifier on this vector has only $C \cdot \text{(number of classes)}$ parameters, dramatically fewer than the flatten-then-fully-connected head of older designs. Global average pooling is now standard at the head of nearly every modern CNN.

From AlexNet to ResNet to ViT

The lineage of convolutional architectures spans more than three decades and frames the rest of this chapter. LeNet (LeCun and colleagues, 1989), refined into the better-known LeNet-5 (1998), was a 7-layer network for handwritten-digit recognition that ran in production on bank cheques and US Postal Service mail. AlexNet (Krizhevsky, Sutskever, Hinton, 2012) won the ImageNet challenge by ten percentage points using eight layers, ReLU activations, dropout regularisation, and GPU training; this single result triggered the deep-learning revolution in vision. VGG (Simonyan and Zisserman, 2014) demonstrated that maximum architectural uniformity, every convolution $3 \times 3$, every pooling $2 \times 2$ stride 2, produced strong results when made deep enough, with 16- and 19-layer variants. GoogLeNet / Inception (Szegedy and colleagues, 2014) used multi-scale Inception modules to capture features at many spatial scales while keeping parameter count low.

The decisive architectural advance was ResNet (He, Zhang, Ren, Sun, 2015), which introduced the residual connection: a shortcut that skips two or three layers and adds the input directly to the output. ResNet trained networks of 152 layers and beyond, beating the human-baseline ImageNet error for the first time, and the residual connection has been a near-universal building block ever since. Subsequent designs refined this template: DenseNet (2017) connected every layer to every preceding layer within a block; MobileNet (2017) and EfficientNet (2019) used depthwise separable convolutions and compound scaling for mobile-grade efficiency; ConvNeXt (2022) modernised the ResNet design with $7 \times 7$ depthwise kernels, GELU activations and LayerNorm. The Vision Transformer (Dosovitskiy and colleagues, 2020) replaced convolutions altogether with image patches and self-attention; with sufficient pretraining data, ViTs match or exceed CNNs on the standard benchmarks, and hybrid architectures combining a convolutional stem with attention in deeper layers now dominate the practical landscape.

We will work through each of these architectures in detail in §11.3. The recurring theme is that every architectural step solved a problem the previous step had created: parameter explosion, vanishing gradients, mismatched receptive fields, mobile compute constraints, the limits of inductive bias. The history of CNNs is the history of those constraints being softened, one at a time.

What you should take away

Convolutional layers replace dense matrix multiplication with a small filter that slides across the input, applying the same weights at every spatial position.
The two architectural commitments, local connectivity and weight sharing, produce three consequences: parameter counts independent of image size, translation equivariance, and a hierarchical feature representation that composes cleanly through depth.
A fully connected layer mapping a 224×224 RGB image to 1{,}000 hidden units uses $\sim 1.5 \times 10^8$ parameters; a convolutional layer with 64 filters of $3 \times 3$ uses 1{,}792. The economy is roughly four orders of magnitude.
Stride controls spatial downsampling, padding preserves spatial size at layer boundaries, and channel count grows with depth as spatial resolution shrinks; pooling, especially global average pooling at the head, reduces feature volumes to compact vectors.
The architectural lineage from LeNet through AlexNet, VGG, Inception, ResNet, MobileNet, EfficientNet and ConvNeXt to the Vision Transformer is a sequence of solutions to the problems each previous step exposed; it is the practical scaffolding against which every new vision architecture is still measured.