A 3×3 kernel sweeps a 9×9 input, filling in a feature map cell by cell.
From the chapter: Chapter 11: CNNs
Glossary: convolution, kernel, feature map, translation equivariance
People: yann lecun, kunihiko fukushima
Transcript
A convolution is the core operation of an image-processing network. A small filter, the kernel, slides over the input image, multiplying and summing as it goes.
The input is a nine-by-nine image. The kernel is three by three. The output is a seven-by-seven feature map.
The blue rectangle highlights the patch the kernel is currently looking at. Multiply kernel by patch pointwise, sum the values, and the result becomes one cell of the output.
The kernel sweeps left to right, top to bottom, filling in the entire output map.
This particular kernel detects horizontal edges. The output lights up wherever the input changes vertically, dark to light or light to dark.
A different kernel reveals different features. Same input, same algorithm, but a vertical-edge detector picks out a different pattern.
A convolutional network learns thousands of these kernels automatically, by gradient descent. They build up edges, then textures, then parts, then entire objects.