Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, & Kaiming He (2017), References, Textbook of AI

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, & Kaiming He (2017)

arXiv:1706.02677.

URL: https://arxiv.org/abs/1706.02677

Abstract. Facebook AI Research's recipe for scaling minibatch SGD to thousands of GPUs while preserving generalisation. Introduces the linear scaling rule, multiply learning rate by the same factor as the batch size, paired with a learning-rate warm-up to handle the early-training instability that the linear-scaling rule produces. Trains ResNet-50 on ImageNet to 76.3% top-1 in one hour using 256 GPUs and a batch size of 8,192. The recipe became standard in the large-batch deep-learning era and quantitatively related batch size to noise injection.

Tags: optimisation distributed-training imagenet

Cited in:

Chapter 6: ML Fundamentals

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour