Scaling Laws, Glossary, Textbook of AI

Scaling Laws in deep learning describe how model performance improves as a function of scale, measured by parameter count, training data size, and compute budget. The seminal work of Kaplan et al. (2020) at OpenAI showed that the cross-entropy loss of language models follows a smooth power law in each of these variables, with diminishing but predictable returns. These empirical observations transformed the field's understanding of what determines model performance and guided the investment in ever-larger models.

Hoffmann et al. (2022), the "Chinchilla paper", refined the analysis by showing that the number of training tokens should be scaled roughly in proportion to the number of parameters. By this analysis, GPT-3 (175B parameters, 300B tokens) was significantly undertrained: a model of that size should have seen roughly 3.4 trillion tokens. This insight shifted the field's emphasis from "just make the model bigger" to "train a moderate-sized model on much more data." LLaMA exemplified this approach: a 65B-parameter model trained on 1.4T tokens matched GPT-3 performance at a fraction of the inference cost.

Scaling laws have important practical implications. They allow researchers to predict, with reasonable accuracy, the performance of an untrained model of a given size and data budget, enabling rational allocation of compute. They also imply diminishing returns: doubling performance requires roughly a 10× increase in compute, suggesting that raw scaling will eventually become economically impractical. More recent work explores scaling laws for specific capabilities (reasoning, coding, math), for mixture-of-experts models, and for multimodal pretraining. Scaling laws are the closest thing deep learning has to a physics: empirical regularities that hold across orders of magnitude.

Interactive

Test-time compute scaling. Accuracy as a function of inference budget for three model strengths.

Scaling laws: compute, data, and parameters jointly determine loss. Plot loss against compute on a log-log scale and you get a clean line.

Related terms: Large Language Model

Discussed in:

Chapter 15: Modern AI, Large Language Models

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.