Also known as: CNN, ConvNet
A convolutional neural network (CNN) is a neural network whose hidden layers are dominated by convolutional layers that apply learned filters across an input grid (image, spectrogram, etc.) with weight sharing. The key architectural elements:
Convolutional layer: applies $C_\mathrm{out}$ filters $K \in \mathbb{R}^{k_h \times k_w \times C_\mathrm{in}}$ across spatial positions of an input feature map $X \in \mathbb{R}^{H \times W \times C_\mathrm{in}}$:
$$Y[i, j, c] = \sum_{m, n, c'} K_c[m, n, c'] \cdot X[i+m, j+n, c'] + b_c$$
Activation function (typically ReLU): $\mathrm{ReLU}(Y) = \max(0, Y)$ applied element-wise.
Pooling layer (max or average over local windows): downsamples spatial dimensions while preserving the most salient activations. Max pooling with $p \times p$ windows takes $\max$ over each window.
Fully-connected layers at the top map the final feature map to class logits.
Standard architectures:
- LeNet-5 (LeCun 1998): 5 layers, the original CNN template (conv-pool-conv-pool-FC-FC-FC).
- AlexNet (Krizhevsky 2012): 8 layers, ReLU, dropout, GPU training, won ImageNet.
- VGG (Simonyan & Zisserman 2014): 16-19 layers, $3 \times 3$ convs throughout.
- GoogLeNet/Inception (Szegedy 2014): parallel paths of different filter sizes; $1 \times 1$ convs for dimensionality reduction.
- ResNet (He 2015): residual connections enable training of $\geq 100$-layer networks.
- EfficientNet (Tan & Le 2019): joint scaling of depth, width, and resolution.
Receptive field: each unit of layer $l$ depends on a region of the input determined by the cumulative kernel sizes and strides of layers $1, \ldots, l$. Deeper layers have larger receptive fields, allowing the network to integrate increasingly global context.
Inductive biases of CNNs: translation equivariance (a feature detected at one spatial position is detected at any), locality (each unit sees only a local input region), hierarchy (deeper layers compose features from earlier layers). These biases match natural-image statistics and were what made CNNs dramatically more sample-efficient than fully-connected networks on vision tasks.
Modern post-Transformer: Vision Transformers (ViT, 2020) and ConvNeXt (2022) blur the line between CNN and Transformer; the dominant architectures of the 2020s combine both ideas.
Video
Related terms: Convolution, AlexNet, ResNet, Vision Transformer, yann-lecun
Discussed in:
- Chapter 11: CNNs, CNNs in Vision