Chapter Eleven

CNNs

Learning Objectives
  1. Describe the convolution operation, filters, stride, and padding, and compute output dimensions
  2. Explain the role of pooling layers and compare max-pooling and average-pooling
  3. Trace the evolution of CNN architectures from LeNet through AlexNet, VGG, ResNet, and beyond
  4. Use transfer learning to adapt pretrained networks to new tasks with limited data
  5. Outline the architectures used for object detection (YOLO, Faster R-CNN) and semantic segmentation (U-Net, DeepLab)

Your eyes do not process a whole scene at once. Cells in the primary visual cortex respond to small local regions (edges of a particular orientation, dots of a particular size, textures of a particular grain), and deeper layers of the brain combine these elementary detectors into objects, faces and scenes. David Hubel and Torsten Wiesel demonstrated this hierarchical organisation in cats in 1959, work that earned them the 1981 Nobel Prize. Twenty years later, Kunihiko Fukushima built the Neocognitron (1980), a neural network whose architecture mirrored the cortical hierarchy. Nine years after that, Yann LeCun's group at Bell Labs trained a similar architecture by gradient descent and called it a convolutional neural network.

A CNN forces each neuron to look at a small patch of the input. It uses the same set of weights at every spatial position, so an edge detector that works in the top-left corner also works in the bottom-right. This cuts the parameter count by orders of magnitude relative to a fully connected network of similar capacity, and it bakes in two important inductive biases. The first is locality: the most informative visual features, edges, corners, blobs of colour, fragments of texture, live in small neighbourhoods. The second is translation equivariance: shifting the input shifts the output by exactly the same amount. Pooling layers, applied on top, add a measure of translation invariance, where small spatial shifts produce no change in the output at all.

This chapter develops the theory and practice of convolutional networks. We start with the convolution operation itself, working through the algebra, computing receptive fields, and walking through a 5×5 input by hand. We then build pooling layers, examine the LeNet-5 reference architecture, and trace the lineage that produced AlexNet (2012), VGG (2014), Inception (2014), ResNet (2015), DenseNet (2017), MobileNet (2017), EfficientNet (2019), and ConvNeXt (2022). We cover batch normalisation, transfer learning, object detection (R-CNN through DETR), semantic and instance segmentation (FCN, U-Net, Mask R-CNN, the Segment Anything Model), and conclude with a from-scratch CIFAR-10 small ResNet in PyTorch and a forward pointer to Vision Transformers in Chapter 13.

In this chapter

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.