Glossary

Chinchilla Scaling

Chinchilla scaling refers to the finding by Hoffmann et al. (2022) at DeepMind that, for a given compute budget, optimal language-model performance is achieved with substantially more training data and a smaller model than Kaplan et al. (2020) had recommended. Specifically: at compute-optimality, parameters N and tokens D should grow at approximately equal rates, so that N ≈ D / 20.

The empirical demonstration was clear. Hoffmann et al.'s 70B-parameter Chinchilla model trained on 1.4 trillion tokens substantially outperformed DeepMind's earlier 280B-parameter Gopher trained on 300B tokens at equal compute. The result reshaped frontier-model training practice immediately.

The original Kaplan recommendations, under which parameters were prioritised over data, produced models that were "compute-bottlenecked" rather than data-bottlenecked. Chinchilla's conclusion was that at any practical compute budget, models had been substantially undertrained on data and oversized on parameters.

LLaMA (2023), LLaMA 2 (2023), Mistral (2023), Falcon (2023) and most subsequent open-weights models follow Chinchilla scaling. Frontier closed models (GPT-4, Claude 3, Gemini 1.5) are widely believed to follow Chinchilla scaling though specific training details are not disclosed. The 2024 generation of "data-rich" recipes, using more than 15 tokens per parameter, far past Chinchilla optimality, have explored whether overtraining produces useful improvements at the cost of inference efficiency.

Related terms: Scaling Laws, jared-kaplan

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).