The Convolution operation, in the context of neural networks, slides a small learnable kernel (or filter) across spatial positions of an input and computes, at each position, the sum of element-wise products between the kernel and the underlying input patch. The result is a feature map in which each entry indicates how strongly the local input patch matches the pattern encoded by the kernel.
Convolution embodies two key structural assumptions: local connectivity (each output depends only on a small neighbourhood of the input) and weight sharing (the same kernel is applied at every spatial position). Together these dramatically reduce parameter count—a 3×3 kernel on a $C$-channel input has just $9C + 1$ parameters regardless of spatial resolution, compared to millions for a fully connected layer on a moderate image.
Several design choices govern convolution behaviour. Stride determines how far the kernel advances between positions (stride 2 halves the spatial resolution). Padding adds values (usually zeros) around the border to control output size ("same" keeps resolution unchanged, "valid" uses no padding). Dilation inserts gaps between kernel elements, enlarging the receptive field without adding parameters. Depthwise separable convolution factorises a standard convolution into a channel-wise spatial convolution followed by a 1×1 pointwise convolution, dramatically reducing compute—the basis of efficient architectures like MobileNet.
Related terms: Convolutional Neural Network, Pooling
Discussed in:
- Chapter 11: CNNs — Convolution
Also defined in: Textbook of AI