Summary

This chapter has presented convolutional neural networks as the backbone of modern computer vision. The convolution operation captures the symmetries natural images possess, locality, translation equivariance, and weight sharing, and the resulting parameter economy is what makes deep visual models practical at all. Stride, padding, and dilation give us fine control over spatial resolution. Pooling, especially global average pooling, summarises feature maps into compact representations. The architectural lineage from LeNet-5 (1989) through AlexNet (2012), VGG (2014), Inception (2014), and ResNet (2015) to DenseNet, MobileNet, EfficientNet, and ConvNeXt represents a quarter-century of accumulated wisdom about how to train very deep networks. Object detection (R-CNN through DETR) and segmentation (FCN, U-Net, Mask R-CNN, SAM) extend classification to dense prediction. Transfer learning, batch normalisation, and self-supervised pretraining are the practical scaffolding that makes the whole thing work in real applications. The chapter's PyTorch ResNet on CIFAR-10 brings these ideas together in working code; Chapters 12 and 13 will set CNNs alongside Transformers, and the two architectures will from then on coexist.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).