A pixel in layer three sees a much bigger patch of the input than a pixel in layer one.
From the chapter: Chapter 11: CNNs
Glossary: receptive field, convolution, pooling, cnn
Transcript
A convolutional neural network builds an understanding of the image one layer at a time. Each layer's view of the input is called its receptive field.
Here is the input image. A single pixel in the first convolutional layer sees only a three by three patch of the input.
Stack a second three by three convolution. Now a pixel in the second layer sees a five by five patch of the input, because each of its inputs already saw a three by three patch.
After a third layer, the receptive field has grown to seven by seven. After ten layers it is twenty-one by twenty-one.
Pooling layers grow the receptive field much faster. A single max-pool of size two doubles the receptive field of every layer above it.
Strided convolutions do the same job: they down-sample the spatial map and expand the field of view.
This is why a deep convolutional network can recognise an entire object from a small set of pixels at the top: the receptive field of those pixels covers most of the image.