Epoch, Glossary, Textbook of AI

An epoch is one complete pass through the training dataset. If the training set has $N$ examples and the batch size is $B$, then one epoch consists of $\lceil N / B \rceil$ parameter updates, one per mini-batch. Training typically proceeds for many epochs (tens to hundreds for classical deep learning), though modern large-language-model pretraining often sees each token only once, making the concept of "epoch" less directly applicable.

Why epochs matter

The number of epochs is a central hyperparameter of training. The epoch count interacts with:

Generalisation: too few epochs and the model underfits (training error still high); too many and the model overfits (validation error rising while training error keeps falling).
Compute budget: GPU-hours scale linearly with epochs.
Optimiser dynamics: many learning-rate schedules are specified in epoch units.
Reproducibility: experimental protocols are quoted as "trained for 90 epochs", "fine-tuned for 3 epochs", etc.

Early stopping automates the choice by monitoring validation loss and halting training when it stops improving, usually with a patience window of several epochs.

Learning-rate schedules in epoch units

Classical recipes specify learning-rate schedules in epochs:

Step decay: divide the learning rate by 10 at epochs 30, 60, 90 (the standard ResNet-on-ImageNet recipe).
Cosine annealing: $\eta_t = \eta_\min + \tfrac{1}{2}(\eta_\max - \eta_\min)(1 + \cos(\pi t / T))$ over a budget of $T$ epochs.
Warm-up: linear ramp for the first 5-10 epochs to stabilise large-batch training, followed by the main schedule.
One-cycle: increase learning rate to a peak in the first half of training, anneal to a minimum in the second half (Smith, 2017).
SGDR (Stochastic Gradient Descent with Warm Restarts): cosine annealing with periodic restarts every $T_i$ epochs (Loshchilov & Hutter, 2017).

Steps versus epochs

For very large datasets, partial epochs are the meaningful unit: one might evaluate on validation every $k$ steps rather than every epoch. The step (one parameter update) is the more fundamental unit, and modern training frameworks (PyTorch Lightning, Hugging Face Trainer) accept both max_epochs and max_steps.

The relationship is:

$$\text{steps per epoch} = \left\lceil \frac{N}{B \cdot G \cdot W} \right\rceil,$$

where $N$ is the dataset size, $B$ the per-device micro-batch, $G$ the gradient-accumulation factor, and $W$ the number of data-parallel workers. Doubling the worker count halves steps-per-epoch.

The LLM era: training-token budgets

For massive datasets where a single epoch is already excessive, modern LLM pretraining corpora contain trillions of tokens, training typically runs for a fixed compute budget or fixed token count, with the concept of "epoch" becoming less central. Llama-2 trained on 2 trillion tokens; Llama-3 on 15 trillion. Each token is typically seen once or only a small constant number of times.

The Chinchilla scaling laws (Hoffmann et al., 2022) make this precise: for compute-optimal training, the number of training tokens should be roughly 20× the parameter count. A 70B-parameter model wants 1.4 trillion tokens. In this regime the relevant quantities are:

Parameters $N$.
Training tokens $D$.
Compute FLOPs $C \approx 6ND$.

These three together, not epochs, predict performance via the scaling laws.

Multi-epoch training in the LLM era

Whether to train for multiple epochs in the LLM regime is a contested empirical question. Muennighoff et al. (2023) found that up to ~4 epochs can be beneficial when data is the bottleneck (the data-constrained scaling laws). For fine-tuning of pre-trained LLMs, 1-3 epochs is typical and more often hurts than helps. For instruction tuning and RLHF, the relevant unit is again steps or examples seen.

Practical recommendations

For classical deep learning (vision, tabular, speech), epochs remain the right unit; use early stopping with patience to choose. For LLM pretraining, switch to token budgets and scaling laws. For LLM fine-tuning, stay near 1-3 epochs and watch validation loss closely.

Related terms: Batch Size, Gradient Descent, Early Stopping, Scaling Laws

Discussed in:

Chapter 6: ML Fundamentals, Training Neural Networks
Chapter 9: Neural Networks, Scaling

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.