CNNs: 11.2   Pooling

Dr Chris Paton

11.2 Pooling

Convolution leaves the spatial grid largely intact. If we feed a $224 \times 224$ image through a stack of unit-stride convolutions, the deeper layers still hold $224 \times 224$ activations per channel, which is wasteful in compute and memory and never lets a single deep unit "see" most of the image. We therefore need a complementary operation whose explicit job is to shrink the spatial grid. That operation is pooling.

Pooling layers take a feature map and aggregate the values within small windows, typically $2 \times 2$, into a single output value per window. The two classical choices are to take the maximum (max pooling) or the arithmetic mean (average pooling). Pooling has no learnable parameters: it is a fixed reduction. It does three useful things at once. It reduces the spatial resolution, often by a factor of two in each direction, which quarters the number of activations and so quarters the cost of every downstream layer. It introduces a small amount of translation invariance, because shifting the input by a pixel or two within the pooling window leaves the output untouched. And, in concert with stacked convolutions, it grows the receptive field of the deeper units, so that by the top of the network each unit summarises a large patch of the input.

Modern practice has partly moved away from pooling. Many architectures replace some or all pooling layers with strided convolutions, which downsample and learn the reduction at the same time. Even so, you will still meet max pooling early in many networks, average pooling in segmentation models, and global average pooling at the top of almost every modern classifier. This section walks through each of these, in the order in which you tend to meet them.

Symbols Used Here

$\mathbf{X}$input feature map (height $H$, width $W$, $C$ channels)

$\mathbf{Y}$output feature map after pooling

$k$pool window size (typically $k = 2$)

$s$stride between windows (typically $s = k$ for non-overlapping pooling)

Max pooling

Max pooling slides a $k \times k$ window over the feature map with stride $s$ and replaces each window by the largest activation it contains. With $s = k$ the windows tile the input with no overlap, so a $k \times k$ pool with stride $k$ takes an $H \times W$ map down to roughly $H/k \times W/k$. The operation runs independently on each channel: the channel dimension is untouched.

The motivation is feature-detection. A convolutional channel typically learns to fire strongly in the presence of a particular pattern (an oriented edge, a blob, a piece of texture), and weakly elsewhere. The maximum within a window is therefore a faithful summary: "did this feature appear anywhere in this neighbourhood?" If we shift the input by a pixel, the strongest activation usually still falls inside the same pooling window, so the output is unchanged. This is the sense in which pooling buys translation invariance. It is local, modest invariance: a $2 \times 2$ window forgives shifts of at most one pixel, but several stacked $2 \times 2$ pools compound into much larger tolerances.

A small worked example makes the operation concrete. Take a $4 \times 4$ feature map and a $2 \times 2$ max pool with stride 2:

$$ \mathbf{X} = \begin{bmatrix} 1 & 3 & 2 & 4 \\ 5 & 6 & 7 & 8 \\ 3 & 2 & 1 & 0 \\ 1 & 2 & 3 & 4 \end{bmatrix}, \qquad \mathbf{Y}_{\max} = \begin{bmatrix} 6 & 8 \\ 3 & 4 \end{bmatrix}. $$

The top-left output, 6, is the largest entry in the top-left $2 \times 2$ block; the other three follow the same recipe. Stride 2 means the windows do not overlap, so each input entry contributes to exactly one output entry.

Max pooling is the default in classical classification networks (LeNet, AlexNet, VGG) precisely because of this combination of strong-feature preservation and small translation invariance. During backpropagation, the gradient flows only through the position of the maximum: the "winner takes all", and the other three positions in the window receive no gradient. This is a form of implicit feature selection inside the network and is one of the reasons max pooling tends to produce sharply tuned filters.

Average pooling

Average pooling replaces each window by its arithmetic mean rather than its maximum. The output is a smooth, low-pass version of the input: peaky activations are blunted, and the operation behaves rather like a downsampled blur. Where max pooling asks "did this feature appear?", average pooling asks "how strongly, on average, was this feature present?".

For the same $4 \times 4$ map and $2 \times 2$ stride-2 window,

$$ \mathbf{Y}_{\text{avg}} = \begin{bmatrix} \tfrac{1+3+5+6}{4} & \tfrac{2+4+7+8}{4} \\[2pt] \tfrac{3+2+1+2}{4} & \tfrac{1+0+3+4}{4} \end{bmatrix} = \begin{bmatrix} 3.75 & 5.25 \\ 2.00 & 2.00 \end{bmatrix}. $$

In classification networks the early layers tend to use max pooling, because preserving the strongest evidence is what we want when downstream layers must decide whether a feature was present. Average pooling shows up more often in two settings. The first is segmentation and dense-prediction architectures, where averaging produces less aliased outputs that are easier to upsample later. The second, and more important, is at the very top of a classifier, where we use it in a slightly extreme form known as global average pooling.

The backward pass for average pooling is symmetric: every input position in the window receives the same $1/k^2$ share of the upstream gradient. There is no winner-takes-all behaviour, which is one reason average pooling tends to give smoother, less specialised feature maps than max pooling at equivalent depth.

Global average pooling

By the time a deep classifier has reduced the spatial size to perhaps $7 \times 7$ with several hundred channels, we have to convert this volume into a flat vector to feed into the final classifier. The classical recipe (LeNet, AlexNet, VGG) was to flatten the volume into a long vector and pass it through one or two fully connected layers. These were enormous: VGG-16 has 138 million parameters in total, of which around 123 million live in three fully connected layers at the top. The fully connected layers were also fragile, prone to overfitting, and forced a fixed input size.

Global average pooling (GAP), introduced by Lin, Chen and Yan in their 2014 Network in Network paper, replaces this whole stack with one extreme average pool: take the mean over the entire spatial extent of each channel. A $7 \times 7 \times 512$ activation volume becomes a $512$-dimensional vector, one number per channel. A single linear layer on top maps this vector to the class scores, contributing only $C \times \text{(number of classes)}$ parameters.

GAP is now the standard top-of-network choice in nearly every modern CNN (GoogLeNet, ResNet, DenseNet, EfficientNet, ConvNeXt) for three reasons. First, it eliminates a huge mass of parameters and the overfitting risk they bring, often without harming accuracy. Second, it imposes a useful interpretive bias: each channel must now correspond to a meaningful, spatially averaged feature, because that single average is its only contribution to the final decision. Third, it removes the fixed-input-size constraint of fully connected heads. A GAP-headed network can in principle accept inputs of any size, because the spatial dimensions are reduced to a single number per channel regardless of how many positions they originally held. This property is essential for fully convolutional segmentation networks and for variable-resolution training.

Strided convolutions as alternative

Pooling is not the only way to shrink a feature map. A convolutional layer with stride $s > 1$ also downsamples: it slides its kernel across the input in steps of $s$, producing one output for every $s$-th position. A stride-2 convolution therefore halves height and width in exactly the same way as a $2 \times 2$ stride-2 pool, but with two crucial differences. First, the reduction is learned, because the kernel weights are trainable and can adapt to the data. Second, the reduction can mix channels, because a convolution combines values across the input channel dimension where pooling does not.

Springenberg and colleagues, in their 2014 Striving for Simplicity paper, took this idea to its logical conclusion: their all-convolutional network removed every pooling layer and replaced it with a stride-2 convolution, and matched the accuracy of pooling-based competitors on CIFAR-10. ResNet downsamples almost exclusively through stride-2 convolutions inside its residual blocks. Most modern networks follow the same pattern: max pooling appears early, if at all, and the deeper downsampling is done by strided convolutions. The cost is a few extra parameters and a little more compute; the benefit is that the network learns its own reduction rather than being forced into a fixed one. Pooling, however, has not disappeared, it is parameter-free, immediate to implement, and makes a sensible default early in the network where the strongest-feature heuristic genuinely helps.

What you should take away

Pooling shrinks the spatial grid by aggregating over $k \times k$ windows. The standard configuration is a $2 \times 2$ window with stride 2, which halves height and width and quarters the number of spatial positions.
Max pooling preserves the strongest evidence for each feature within a window and gives a small amount of translation invariance, making it the classical default early in a classifier.
Average pooling smooths rather than peaks, and is preferred in segmentation heads and in the global-pooling step at the top of modern classifiers.
Global average pooling replaces the giant fully connected head of a VGG-style network with a single mean-per-channel reduction, dramatically cutting parameter count and removing the fixed-input-size constraint.
Strided convolutions are the modern alternative to pooling: they downsample and learn the reduction simultaneously, and they dominate the deeper stages of contemporary architectures such as ResNet and ConvNeXt.