LeNet-5, AlexNet, VGG, GoogLeNet, ResNet. Each year deeper, with new tricks.
From the chapter: Chapter 11: CNNs
Glossary: lenet, alexnet, resnet, vgg
Transcript
LeNet-5. 1998. Yann LeCun. Two convolutional layers, two pooling layers, three fully-connected layers. Trained to read handwritten digits on cheques. Around sixty thousand parameters.
AlexNet. 2012. Krizhevsky, Sutskever, Hinton. Eight layers. Sixty million parameters. ReLU activations. Dropout for regularisation. Trained on two GPUs. Won ImageNet by ten percentage points and started the deep learning revolution.
VGG. 2014. Simonyan and Zisserman. Sixteen and nineteen layers, all using three-by-three convolutions stacked. The simplest architecture, the deepest yet, a hundred and forty million parameters.
GoogLeNet, also called Inception. 2014. Twenty-two layers, careful use of one-by-one convolutions to control parameter count. Multiple parallel filter sizes inside an inception module.
ResNet. 2015. Kaiming He. Up to a hundred and fifty-two layers. The breakthrough was the residual connection: each block computes a small adjustment added to the input. Gradients flow directly through the shortcut. Deep networks suddenly trainable.
Top-five error on ImageNet across the same scale: LeNet impossible, AlexNet eighteen percent, VGG seven percent, GoogLeNet six percent, ResNet three point six percent.
After ResNet, the trick spread everywhere. Transformers use residuals around every attention and MLP block. Diffusion U-Nets use them in every stage. The residual is now the default.