Chapter Eleven

CNNs

Learning Objectives
  1. Describe the convolution operation, filters, stride, and padding, and compute output dimensions
  2. Explain the role of pooling layers and compare max-pooling and average-pooling
  3. Trace the evolution of CNN architectures from LeNet through AlexNet, VGG, ResNet, and beyond
  4. Use transfer learning to adapt pretrained networks to new tasks with limited data
  5. Outline the architectures used for object detection (YOLO, Faster R-CNN) and semantic segmentation (U-Net, DeepLab)

Your eyes do not process a whole scene at once. Cells in your visual cortex respond to small local regions — edges, corners, textures — and deeper layers of the brain combine these into objects and scenes. CNNs work the same way.

A CNN forces each neuron to look at a small patch of the input. It uses the same set of weights at every spot, so an edge finder that works in the top-left corner also works in the bottom-right. This cuts the number of weights by a huge factor, while baking in two key inductive biases: local features matter (local connectivity), and they can show up anywhere (translation equivariance). Convolution is equivariant — shifting the input shifts the output by the same amount. Pooling adds a degree of translation invariance, where small shifts produce no change at all.

This chapter covers the convolution operation, pooling layers, landmark architectures from LeNet to ViT, transfer learning, object detection, and semantic segmentation.

11.1   Convolution

The Core Operation

A convolution slides a small learnable filter — the kernel — across every position of the input. At each position, it computes a dot product between the kernel and the local patch:

Y[i, j] = Σm,n K[m, n] · X[i + m, j + n]

The kernel is typically 3×3 or 5×5 pixels. The result is a feature map (or activation map) where each entry says how strongly the local patch matches the kernel's pattern. (Technically, this is cross-correlation. Deep learning frameworks call it "convolution" because the kernel is learned, making the flip irrelevant.)

Two Key Properties

  • Local connectivity: each output neuron depends only on a small neighbourhood. This matches the fact that the most informative visual features — edges, textures, corners — are local.
  • Weight sharing: the same kernel is applied everywhere. A 3×3 kernel on C input channels has only 9C + 1 parameters, regardless of image size. A fully connected layer connecting every pixel to every output would need millions.

Multiple Kernels

A convolutional layer applies many kernels, each producing its own feature map. If the layer has K kernels, the output is height × width × K. Early layers learn edge detectors and colour blobs. Deeper layers combine these into parts of objects and eventually whole categories.

Design Choices

  • Stride: how far the kernel moves between positions. Stride 1 keeps the spatial size. Stride 2 halves it (built-in downsampling).
  • Padding: adding zeros around the border. "Same" padding keeps the output the same size as the input. "Valid" padding applies no padding, shrinking the output.
  • Dilation (atrous convolution): inserts gaps in the kernel, enlarging the receptive field without adding parameters. Useful for capturing context at multiple scales, especially in segmentation.

Depthwise Separable Convolution

A standard convolution is expensive. The depthwise separable version factorises it into two steps:

  1. Depthwise convolution: apply a separate spatial kernel to each channel independently.
  2. Pointwise convolution (1×1): mix information across channels.

This cuts compute by roughly K^2^ times (kernel size squared) while keeping most of the learning power. It is the basis of lean models like MobileNet Howard, 2017 and Xception, built for phones and small devices.

Activation

After the convolution, outputs pass through a nonlinear step — usually ReLU (set negatives to zero). The mix of linear filtering and nonlinear gating gives CNNs their power. Stacking many such layers builds complex global patterns from simple local ones.

Research continues on extensions: deformable convolutions (which learn offsets for the sampling grid), group convolutions (which partition channels into independently processed groups), and attention-augmented convolutions that combine local inductive bias with global receptive fields.

11.2   Pooling

Pooling layers reduce the spatial size of feature maps, cutting compute and parameters in later layers. A pooling layer works on each channel independently, replacing a small spatial window with a single summary value.

Max Pooling vs Average Pooling

  • Max pooling: keeps the strongest activation in each window. The most common choice for classification. Provides local translation invariance — small shifts within the pooling window do not change the output.
  • Average pooling: keeps the mean activation. Preserves more information about overall intensity.

A typical setup: 2×2 window, stride 2. This halves height and width, reducing spatial positions by 4×.

Global Average Pooling (GAP)

Modern architectures replace the traditional flatten-then-fully-connected approach with GAP. It computes a single average per channel across the entire spatial extent, producing a vector of length equal to the number of channels. This eliminates many parameters and forces each channel to correspond to a meaningful feature. GAP is used in GoogLeNet, ResNet, EfficientNet, and nearly all modern CNNs.

Do We Even Need Pooling?

Some researchers argue that pooling discards useful spatial information and should be replaced with strided convolutions — which downsample but keep learnable parameters. The all-convolutional network showed that pooling can be eliminated without loss of accuracy. In practice, pooling remains common for its simplicity and parameter-free nature.

Receptive Fields

The receptive field of a neuron is the region of the input image that influences its value. Receptive fields grow with depth, and pooling accelerates this growth by expanding the effective stride of all subsequent layers. A well-designed CNN balances the rate of spatial downsampling with the need to keep enough resolution for the task. Classification can be aggressive with downsampling; segmentation must eventually upsample back to input resolution (Section 11.6).

11.3   CNN Architectures

LeNet-5 (1989)

Yann LeCun's LeNet-5 Lecun, 1998 was the first practical CNN: two convolutional layers, three fully connected layers, average pooling, sigmoid activations. It read postal codes and bank cheques in production. Modest by today's standards, but it established the template that would persist for decades.

AlexNet (2012)

AlexNet Krizhevsky, 2012 won the ImageNet challenge by over 10 percentage points, launching the deep learning revolution. Five convolutional layers, three fully connected layers. Key innovations: ReLU activations, dropout, GPU training. Top-5 error: 15.3%, smashing the hand-engineered competition.

VGGNet (2014)

VGGNet Simonyan, 2014 showed that stacking many 3×3 kernels to great depth (16 or 19 layers) works better than using larger kernels. Simple and elegant.

GoogLeNet / Inception (2014)

GoogLeNet Szegedy, 2015 introduced the inception module: convolutions of multiple kernel sizes in parallel, concatenated together. This captures multi-scale features within a single layer.

ResNet (2015)

ResNet He, 2016 introduced the residual connection — the most important architectural innovation in CNN history. As networks grew deeper, training error actually increased (the degradation problem). This was not overfitting — it was an optimisation failure. Deeper networks were harder to train, not just more prone to memorising noise.

ResNet fixed this with shortcut connections that skip layers. Instead of learning the desired mapping H(x) directly, the network learns the residual F(x) = H(x) − x:

output = F(x) + x

If the identity mapping is close to optimal, the network only needs to push F(x) toward zero — much easier than learning H(x) through a stack of nonlinear layers. This enabled training of 50, 101, and 152-layer networks. ResNet-152 hit 3.57% top-5 error on ImageNet, compared to roughly 5% for a human annotator (though the human baseline was a single researcher's informal estimate, so the comparison should be read with care).

After ResNet

  • DenseNet Huang, 2017: connected every layer to every other layer within a block, maximising feature reuse.
  • ResNeXt: grouped convolutions within residual blocks, increasing capacity without proportionally increasing depth or width.
  • SENet (2018): channel-wise attention that models interdependencies between channels to recalibrate feature maps (squeeze-and-excitation).
  • EfficientNet Tan, 2019: used neural architecture search (NAS) to find a baseline, then compound scaling — uniformly scaling depth, width, and resolution. The key insight: balancing all three dimensions is more effective than scaling any one alone.

Vision Transformers (ViT)

ViT Dosovitskiy, 2020 challenged CNN dominance. It splits an image into patches, embeds each linearly, and processes the sequence with a standard Transformer encoder. With enough pre-training data, ViT matches or beats the best CNNs. But hybrid architectures — convolutional stems for local features plus Transformer blocks for global attention — often work best. Convolution and attention are complementary, not competing.

11.4   Transfer Learning

The features a CNN learns are highly reusable. A network trained on ImageNet learns edges in early layers and high-level ideas in later layers. These features work far beyond ImageNet. So instead of training from scratch (which needs vast data and compute), you start with a pretrained model and adapt it.

Feature Extraction

Remove the final classification layer. Use the rest of the network as a fixed feature extractor. Pass images through, extract the feature vector (usually from the GAP layer), and train a small classifier on top. This works even with very few target examples — sometimes just dozens per class.

Fine-Tuning

Unfreeze some or all pretrained layers and continue training with a small learning rate. A common strategy: freeze early layers (generic features that transfer well) and fine-tune later layers (task-specific features). Use a learning rate 10–100× smaller than training from scratch — large updates would destroy the pretrained representations that make transfer learning valuable.

When Does It Work?

Transfer learning works best when source and target are similar (both natural image tasks). But even for very different domains — medical imaging, satellite imagery — starting from ImageNet beats random initialisation. More target data and more fine-tuned layers help bridge larger domain gaps.

Beyond ImageNet

Transfer learning has expanded beyond supervised ImageNet pretraining:

  • Self-supervised methods (SimCLR Chen, 2020, BYOL, DINO): learn visual features from unlabelled images by solving pretext tasks — for example, predicting which augmented views come from the same image. These features transfer at least as well as supervised ones.
  • CLIP Radford, 2021: aligns visual features with natural language, enabling zero-shot transfer by describing categories in text.

These methods make strong features available for a wider range of tasks than ever.

11.5   Object Detection

Putting a label on a whole image is one thing. Finding what objects are in the image and where is much harder. Object detection outputs a class label and a bounding box for each object. The model must handle objects of different sizes, counts, and overlap.

Two-Stage Detectors

R-CNN Girshick, 2014 generated ~2,000 candidate regions using selective search, passed each through a CNN to extract features, and classified them with a linear SVM. Accurate but slow — each region needed a separate forward pass. Fast R-CNN computed CNN features once for the whole image, then extracted per-region features using a RoI (Region of Interest) pooling layer. Faster R-CNN Ren, 2015 replaced selective search with a learnable Region Proposal Network (RPN), making the pipeline end-to-end and much faster.

Single-Stage Detectors

YOLO Redmon, 2016 (You Only Look Once) divides the image into a grid and predicts bounding boxes and class probabilities in one forward pass. Much faster (45 fps) with somewhat lower accuracy, especially for small objects. SSD improved this by predicting at multiple spatial scales.

Anchors and Anchor-Free Methods

Most detectors use anchor boxes — predefined box shapes at each spatial location. The network predicts offsets from these anchors. Newer anchor-free methods (FCOS, CenterNet) predict objects as centre points plus distances to the four sides. Simpler and often equally effective.

Evaluation

The standard metric is mean average precision (mAP). A predicted box counts as correct (a true positive) if its intersection over union (IoU) with a ground-truth box exceeds a threshold (commonly 0.5). For each class, a precision–recall curve is computed and the average precision (AP) is the area under it. The mAP is the mean of AP across all classes. COCO uses stricter evaluation, averaging mAP over IoU thresholds from 0.5 to 0.95 in steps of 0.05.

Modern Detectors

Feature Pyramid Networks (FPN) build multi-scale feature pyramids for handling objects at different sizes. DETR Carion, 2020 reframed detection as a set prediction problem using a Transformer with a bipartite matching loss (the Hungarian algorithm matches predictions to ground-truth objects), eliminating anchors and non-maximum suppression entirely. DETR and its successors represent a fundamental rethinking of the detection pipeline.

11.6   Semantic Segmentation

Object detection draws boxes around objects. Semantic segmentation labels every single pixel in the image — road, car, pedestrian, sky. The output is a label map at the same resolution as the input.

This pixel-level detail is key for self-driving cars (exact boundary between road and path), medical imaging (tumour edges), and augmented reality (placing virtual objects in real scenes).

Fully Convolutional Networks (FCN)

Long, Shelhamer, and Darrell (2015) Long, 2015 replaced the fully connected layers of a classification CNN with 1×1 convolutions and added upsampling layers to restore spatial resolution. The result can accept inputs of any size and produce a dense prediction map — unlike a classification network with fixed input dimensions. Skip connections combine high-resolution features from early layers with semantically rich features from later layers.

U-Net

U-Net Ronneberger, 2015 extended this with a symmetric encoder-decoder structure. The encoder downsamples through convolutions and pooling. The decoder upsamples through transposed convolutions. Long skip connections concatenate encoder features at each resolution level with decoder features, preserving spatial detail. U-Net and its descendants (U-Net++, Attention U-Net) are the standard for medical image segmentation.

Dilated Convolutions and DeepLab

Dilated convolutions widen the receptive field by putting gaps in the kernel — without shrinking the output or adding weights. The DeepLab family Chen, 2018 uses dilated filters at several rates in an Atrous Spatial Pyramid Pooling (ASPP) module, reading context at several scales. Earlier DeepLab versions also used a conditional random field (CRF) as a post-processing step to sharpen boundaries. DeepLab v3+ pairs ASPP with an encoder-decoder for top results on benchmarks like PASCAL VOC and Cityscapes.

Beyond Semantic Segmentation

  • Instance segmentation: distinguishes individual instances of the same class (separate cars in a parking lot). Mask R-CNN adds a segmentation branch to Faster R-CNN.
  • Panoptic segmentation: assigns every pixel to both a semantic class and a specific instance. It distinguishes "thing" classes (countable objects like cars and people) from "stuff" classes (amorphous regions like sky and road). The most complete form of dense visual understanding.

Evaluation

The standard metric is mean intersection over union (mIoU): for each class, take the overlap between predicted and true pixel sets, divide by their union, then average. Transformer-based models (SegFormer, Mask2Former) now rival purely CNN-based methods, showing that attention and convolution work well together.