The convolution operation is the mathematical foundation of every CNN. For a 1D input signal $x \in \mathbb{R}^n$ and kernel $w \in \mathbb{R}^k$, the discrete convolution is
$$(w * x)[i] = \sum_{j=0}^{k-1} w[j] \cdot x[i + j]$$
(Strictly this is cross-correlation; deep learning frameworks call it convolution by convention.) For 2D images $X \in \mathbb{R}^{H \times W}$ and kernel $K \in \mathbb{R}^{k_h \times k_w}$:
$$(K * X)[i, j] = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} K[m, n] \cdot X[i+m, j+n]$$
A convolutional layer applies $C_\mathrm{out}$ different kernels to a multi-channel input $X \in \mathbb{R}^{H \times W \times C_\mathrm{in}}$:
$$Y[i, j, c] = \sum_{m, n, c'} K[m, n, c', c] \cdot X[i+m \cdot s, j+n \cdot s, c'] + b[c]$$
with stride $s$ controlling spatial subsampling. Padding of width $p$ extends the input with zeros (or reflections) so the output has spatial dimensions $\lfloor (H + 2p - k_h)/s \rfloor + 1$.
Parameter sharing is the key efficiency: the same kernel applies at every spatial position, giving translational equivariance, a feature detected at one position is detected anywhere, and reducing parameter count from $O(H W C_\mathrm{in} C_\mathrm{out})$ for a fully-connected layer to $O(k_h k_w C_\mathrm{in} C_\mathrm{out})$.
Convolution can be implemented as matrix multiplication via the im2col rearrangement: the receptive field at every position is unfolded into a column, the kernel is flattened into a row, and the convolution becomes a matrix product. This is how every modern GPU library (cuDNN, MIOpen, oneDNN) computes convolution.
Computational complexity: $O(H' W' k_h k_w C_\mathrm{in} C_\mathrm{out})$ where $H' \times W'$ is the output spatial size. For $1 \times 1$ convolutions this reduces to $O(H' W' C_\mathrm{in} C_\mathrm{out})$, pure channel-mixing, used heavily in modern architectures (Inception, ResNet bottlenecks, MobileNet).
Variants include depthwise separable convolution (depthwise $k \times k$ per channel + $1 \times 1$ pointwise) used in MobileNet, dilated/atrous convolution (gaps in the kernel for larger receptive field) used in DeepLab and WaveNet, and transposed convolution (the gradient of convolution, used for upsampling in segmentation and generative models).
Interactive
Video
Related terms: Convolutional Neural Network, Matrix Multiplication
Discussed in:
- Chapter 11: CNNs, CNNs in Vision