11.9 Beyond CNNs
For a decade after AlexNet, convolutional networks were the dominant architecture for vision. They are no longer alone. Vision Transformers (ViT, 2020) treat an image as a sequence of patches and apply the same self-attention machinery that revolutionised natural language processing, the subject of the next two chapters. With sufficient pretraining data, ViTs match or exceed the best CNNs on ImageNet, and they extend more readily to multimodal tasks because the underlying architecture is the same as in language models.
Yet the convolutional inductive bias, locality, weight sharing, translation equivariance, remains valuable, especially when data is scarce or compute is constrained. Hybrid architectures that combine convolutions in a lightweight stem with attention in deeper layers (the Swin Transformer, ConvNeXt, CoAtNet) often outperform either pure approach. The right answer to "convolutions or attention?" is: both, in the right places.
Chapter 12 develops attention from first principles, motivated by sequence modelling and the limitations of recurrent networks; Chapter 13 ties this to vision via the Vision Transformer, completing the connection between the two architectures that have defined the modern era of deep learning.