Convolution, Glossary, Textbook of AI

The convolution operation is the mathematical foundation of every CNN. For a 1D input signal $x \in \mathbb{R}^n$ and kernel $w \in \mathbb{R}^k$, the discrete convolution is

$$(w * x)[i] = \sum_{j=0}^{k-1} w[j] \cdot x[i + j]$$

(Strictly this is cross-correlation; deep learning frameworks call it convolution by convention.) For 2D images $X \in \mathbb{R}^{H \times W}$ and kernel $K \in \mathbb{R}^{k_h \times k_w}$:

$$(K * X)[i, j] = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} K[m, n] \cdot X[i+m, j+n]$$

A convolutional layer applies $C_\mathrm{out}$ different kernels to a multi-channel input $X \in \mathbb{R}^{H \times W \times C_\mathrm{in}}$:

$$Y[i, j, c] = \sum_{m, n, c'} K[m, n, c', c] \cdot X[i+m \cdot s, j+n \cdot s, c'] + b[c]$$

with stride $s$ controlling spatial subsampling. Padding of width $p$ extends the input with zeros (or reflections) so the output has spatial dimensions $\lfloor (H + 2p - k_h)/s \rfloor + 1$.

Parameter sharing is the key efficiency: the same kernel applies at every spatial position, giving translational equivariance, a feature detected at one position is detected anywhere, and reducing parameter count from $O(H W C_\mathrm{in} C_\mathrm{out})$ for a fully-connected layer to $O(k_h k_w C_\mathrm{in} C_\mathrm{out})$.

Convolution can be implemented as matrix multiplication via the im2col rearrangement: the receptive field at every position is unfolded into a column, the kernel is flattened into a row, and the convolution becomes a matrix product. This is how every modern GPU library (cuDNN, MIOpen, oneDNN) computes convolution.

Computational complexity: $O(H' W' k_h k_w C_\mathrm{in} C_\mathrm{out})$ where $H' \times W'$ is the output spatial size. For $1 \times 1$ convolutions this reduces to $O(H' W' C_\mathrm{in} C_\mathrm{out})$, pure channel-mixing, used heavily in modern architectures (Inception, ResNet bottlenecks, MobileNet).

Variants include depthwise separable convolution (depthwise $k \times k$ per channel + $1 \times 1$ pointwise) used in MobileNet, dilated/atrous convolution (gaps in the kernel for larger receptive field) used in DeepLab and WaveNet, and transposed convolution (the gradient of convolution, used for upsampling in segmentation and generative models).

Interactive

2D convolution, kernel slides over input. A 3×3 kernel sweeps a 9×9 input, filling in a feature map cell by cell.

Stacking convolutions grows the receptive field. A pixel in layer three sees a much bigger patch of the input than a pixel in layer one.

Video

Discussed in:

Chapter 11: CNNs, CNNs in Vision

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.