Convolutional Neural Network, Glossary, Textbook of AI

Also known as: CNN, ConvNet

A convolutional neural network (CNN) is a neural network whose hidden layers are dominated by convolutional layers that apply learned filters across an input grid (image, spectrogram, etc.) with weight sharing. The key architectural elements:

Convolutional layer: applies $C_\mathrm{out}$ filters $K \in \mathbb{R}^{k_h \times k_w \times C_\mathrm{in}}$ across spatial positions of an input feature map $X \in \mathbb{R}^{H \times W \times C_\mathrm{in}}$:

$$Y[i, j, c] = \sum_{m, n, c'} K_c[m, n, c'] \cdot X[i+m, j+n, c'] + b_c$$

Activation function (typically ReLU): $\mathrm{ReLU}(Y) = \max(0, Y)$ applied element-wise.

Pooling layer (max or average over local windows): downsamples spatial dimensions while preserving the most salient activations. Max pooling with $p \times p$ windows takes $\max$ over each window.

Fully-connected layers at the top map the final feature map to class logits.

Standard architectures:

LeNet-5 (LeCun 1998): 5 layers, the original CNN template (conv-pool-conv-pool-FC-FC-FC).
AlexNet (Krizhevsky 2012): 8 layers, ReLU, dropout, GPU training, won ImageNet.
VGG (Simonyan & Zisserman 2014): 16-19 layers, $3 \times 3$ convs throughout.
GoogLeNet/Inception (Szegedy 2014): parallel paths of different filter sizes; $1 \times 1$ convs for dimensionality reduction.
ResNet (He 2015): residual connections enable training of $\geq 100$-layer networks.
EfficientNet (Tan & Le 2019): joint scaling of depth, width, and resolution.

Receptive field: each unit of layer $l$ depends on a region of the input determined by the cumulative kernel sizes and strides of layers $1, \ldots, l$. Deeper layers have larger receptive fields, allowing the network to integrate increasingly global context.

Inductive biases of CNNs: translation equivariance (a feature detected at one spatial position is detected at any), locality (each unit sees only a local input region), hierarchy (deeper layers compose features from earlier layers). These biases match natural-image statistics and were what made CNNs dramatically more sample-efficient than fully-connected networks on vision tasks.

Modern post-Transformer: Vision Transformers (ViT, 2020) and ConvNeXt (2022) blur the line between CNN and Transformer; the dominant architectures of the 2020s combine both ideas.

Video

Related terms: Convolution, AlexNet, ResNet, Vision Transformer, yann-lecun

Discussed in:

Chapter 11: CNNs, CNNs in Vision

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.