Scaling laws: compute, data, and parameters jointly determine loss, Textbook of AI

Plot loss against compute on a log-log scale and you get a clean line.

From the chapter: Chapter 15: Modern AI

Transcript

Around 2020, OpenAI and DeepMind plotted training loss against the compute spent.

On log-log axes, the points fell on a straight line. Pretraining loss decreased as a power law in training compute, parameters, and data.

Three knobs. Compute, in floating-point operations. Parameters, the model size. Data, the number of tokens trained on.

Doubling compute reduced loss by a roughly fixed amount. Doubling parameters at fixed compute eventually flattened. Doubling data at fixed parameters eventually flattened. Each knob mattered, but the joint optimum was specific.

Chinchilla, 2022. Hoffmann and colleagues showed that for a given compute budget, the optimal split is to train roughly twenty tokens per parameter. Earlier large models were under-trained.

Llama, then DeepSeek, then Qwen, applied Chinchilla scaling. Smaller models trained on more data caught up to bigger models trained on less.

The scaling law is empirical, not theoretical. It does not say which architecture or which optimiser. It says: across many architectures, this is what happens.

For practitioners, scaling laws turn questions like "how much will doubling our cluster help" into rough quantitative predictions, before training even starts.

Scaling laws also have limits. They describe pretraining loss, not downstream task accuracy. Emergent capabilities, like instruction following or chain-of-thought reasoning, appear suddenly at certain scale. The clean curve in pretraining hides discontinuities downstream.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).