Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, & Laurent Sifre (2022), References, Textbook of AI

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, & Laurent Sifre (2022)

arXiv.

DOI: https://doi.org/10.48550/arxiv.2203.15556

Abstract. The Chinchilla paper: shows that, for a fixed compute budget, models should be trained on roughly 20 tokens per parameter, not merely scaled up. Revealed that GPT-3 and other frontier models were significantly under-trained and shifted the scaling frontier.

Tags: scaling language-models chinchilla

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Training Compute-Optimal Large Language Models