Pooling, Glossary, Textbook of AI

Pooling layers reduce the spatial dimensions of feature maps by summarising small neighbourhoods with a single statistic. The two most common variants are max pooling, which selects the maximum value within each region, and average pooling, which computes the mean. A typical configuration uses a 2×2 window with stride 2, which halves the spatial resolution and reduces the number of spatial positions by a factor of four.

Max pooling is the more widely used variant in classification networks. It retains the strongest activation in each local region, selecting the most prominent feature regardless of its exact location. This provides a degree of local translation invariance: small shifts that move a feature within the same pooling window do not change the output. This property is desirable for classification, where one cares what features are present but not precisely where.

In modern architectures, global average pooling (GAP) has largely replaced flattening plus dense layers before the final classifier. GAP averages each channel across the entire spatial extent, producing a vector of length equal to the number of channels. It eliminates many parameters and provides spatial regularisation. Pooling is not without controversy, some researchers argue it discards valuable spatial information and advocate replacing it with strided convolutions, which downsample while retaining learnable parameters. All-convolutional networks demonstrate pooling layers can be eliminated without loss of accuracy.

Interactive

Stacking convolutions grows the receptive field. A pixel in layer three sees a much bigger patch of the input than a pixel in layer one.

Max pooling and average pooling. A two by two patch reduces to one number. Max takes the largest, average takes the mean.

Video

Related terms: Convolutional Neural Network, Convolution

Discussed in:

Chapter 11: CNNs, Pooling

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.