AlexNet is the convolutional neural network described in the 2012 paper ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton. It won the 2012 ImageNet Large Scale Visual Recognition Challenge with a top-5 error of 15.3%, ten percentage points below the second-place entry.
AlexNet's architecture: 5 convolutional layers (with 96, 256, 384, 384 and 256 filters respectively), 3 fully-connected layers, and a 1000-way softmax output. The network had 60 million parameters and was trained on two Nvidia GTX 580 GPUs over six days using stochastic gradient descent with momentum, with the network split across the two GPUs to fit in memory.
Several architectural and methodological choices, novel or only recently introduced at the time, defined the field for years afterwards: ReLU activation (rectified linear unit, max(0, x)), much faster training than sigmoid or tanh; Dropout in the fully-connected layers, strong regularisation against the era's small training sets; GPU training, the dataset and network were too large for CPU training to be practical; Data augmentation, random crops, horizontal flips, RGB perturbation; Local response normalisation, a form of contrast normalisation across channels (later replaced by batch normalisation).
AlexNet's victory is universally regarded as the moment the modern deep-learning era began. Within a year, all serious computer-vision research had moved to deep CNNs; within five years, the same architectural pattern had begun to dominate speech recognition, machine translation, and many other fields. Without ImageNet (sufficient data) and GPUs (sufficient compute), the AlexNet result would not have been possible, and the deep-learning revolution that followed it would have been delayed by years.
Video
Related terms: alex-krizhevsky, ImageNet, Convolutional Neural Network, Deep Learning
Discussed in:
- Chapter 11: CNNs, CNNs in Vision