Batch Size, Glossary, Textbook of AI

The Batch Size is the number of training examples processed in parallel during each gradient update. In practice, modern deep learning almost always uses mini-batch gradient descent: rather than computing gradients over the full dataset (batch gradient descent) or a single example (pure stochastic gradient descent), one computes gradients over a mini-batch of typically 32 to 512 examples. The choice of batch size affects training speed, memory consumption, generalisation, and optimisation dynamics.

Larger batches provide more accurate gradient estimates and make better use of GPU parallelism, speeding up each epoch. However, very large batches can harm generalisation: the reduced noise in the gradient estimate allows the optimiser to find sharper minima, which tend to generalise worse than the flatter minima found with noisier gradients. There is also a practical cap: batch size is limited by GPU memory, and scaling batch size often requires multiple GPUs with gradient accumulation or distributed data parallelism.

The linear scaling rule suggests that learning rate should be scaled linearly with batch size: doubling batch size warrants doubling the learning rate, provided sufficient warmup is used. This rule enables training with very large batches (thousands to millions) on large-scale distributed systems while preserving final accuracy. LAMB, Layer-wise Adaptive Moments, was developed specifically to enable stable training of BERT with batch sizes in the tens of thousands. Modern pretraining of large language models routinely uses batch sizes of millions of tokens, distributed across thousands of accelerators.

Discussed in:

Chapter 10: Training & Optimisation, Stochastic Gradient Descent

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.