A two by two patch reduces to one number. Max takes the largest, average takes the mean.
From the chapter: Chapter 11: CNNs
Glossary: pooling, max pooling
Transcript
A feature map from a convolutional layer. A grid of activations.
Pooling reduces the spatial size while preserving information.
Take a two by two patch. Replace it with a single number. Slide the patch across the grid in stride two and you have shrunk by half in each dimension. Four pixels become one.
Max pooling takes the largest value in the patch. The strongest activation wins. Small spatial shifts in the input do not change the output much, because the maximum tends to stay the maximum.
Average pooling takes the mean. Smoother, less peaky. Common in the very last layer of a network, where you want to summarise an entire feature map into a single descriptor.
Pooling is parameter-free. There is nothing to learn.
It does two things at once: shrinks the activations, saving memory and computation, and induces a small amount of translation invariance.
Modern architectures sometimes skip pooling, replacing it with strided convolutions that learn the downsampling. The effect is similar.
When you visualise a CNN's receptive fields, pooling is what makes them grow. Each pool doubles the patch of input that a deep neuron can see.