Stacking convolutions grows the receptive field, Textbook of AI

A pixel in layer three sees a much bigger patch of the input than a pixel in layer one.

Glossary: receptive field, convolution, pooling, cnn

Transcript

A convolutional neural network builds an understanding of the image one layer at a time. Each layer's view of the input is called its receptive field.

Here is the input image. A single pixel in the first convolutional layer sees only a three by three patch of the input.

Stack a second three by three convolution. Now a pixel in the second layer sees a five by five patch of the input, because each of its inputs already saw a three by three patch.

After a third layer, the receptive field has grown to seven by seven. After ten layers it is twenty-one by twenty-one.

Pooling layers grow the receptive field much faster. A single max-pool of size two doubles the receptive field of every layer above it.

Strided convolutions do the same job: they down-sample the spatial map and expand the field of view.

This is why a deep convolutional network can recognise an entire object from a small set of pixels at the top: the receptive field of those pixels covers most of the image.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).