Modern AI: 15.3   The pre-training recipe

Dr Chris Paton

15.3 The pre-training recipe

Pre-training a frontier large language model is a months-long, billion-dollar undertaking. Described in a single sentence, the recipe is short: take a decoder-only Transformer, feed it trillions of tokens of carefully filtered text, and minimise the next-token cross-entropy with AdamW. Yet every word in that sentence hides a thicket of engineering and judgement. Which tokens, in what proportion, with what tokeniser? How wide is the model, how deep, how many heads? At what learning rate, with what warmup, on what cluster, with what parallelism strategy? How do you know the run is healthy at hour 240 of 2400, and what do you do if the loss spikes at hour 1900?

Where §15.2 surveyed emergent abilities (what comes out of pre-training), this section is about what goes in. §15.4 picks up with supervised fine-tuning. By the end you should be able to read a frontier-lab pre-training paper and to estimate, to within a factor of two, what a given training run cost.

Symbols Used Here

$N$parameter count

$T$training tokens

$C$total compute (FLOPs) $\approx 6NT$

The Chinchilla recipe

For a long stretch of the scaling era, the field believed that more parameters were always better. GPT-3 had 175 B parameters and was trained on roughly 300 B tokens; the conventional wisdom in 2020 was that the bottleneck was model size, not data. Hoffmann and colleagues at DeepMind upended that view in their 2022 Chinchilla paper. They trained over four hundred small and medium models across a grid of $(N, T)$ values, fitted a parametric loss surface, and asked: at a fixed compute budget $C$, what allocation of compute to parameters versus tokens minimises pre-training loss?

The answer was that optimal $N$ and $T$ scale together, each roughly as $\sqrt{C}$. Concretely, the ratio $T/N \approx 20$ tokens per parameter is close to optimal for the loss surfaces they fitted. By that yardstick, GPT-3 (175 B params, 300 B tokens, ratio 1.7) was severely undertrained. Chinchilla itself was a 70 B-parameter model trained on 1.4 T tokens, using roughly the same compute as Gopher (280 B params, 300 B tokens) but achieving substantially lower loss and better downstream performance.

The Chinchilla paper changed practice immediately. Llama, Llama 2, Llama 3, Mistral, Qwen, DeepSeek and the other open frontier families all train far more tokens per parameter than GPT-3 did, with ratios from 20 up to several hundred for smaller models that will see heavy inference traffic. The compute-optimal point is not the deployment-optimal point: if you intend to serve a model billions of times, it is rational to overtrain a smaller model relative to Chinchilla, because the inference savings dwarf the extra training cost. Llama 3-8B was trained on 15 T tokens, a ratio of nearly 2000, far past the Chinchilla optimum, precisely for this reason.

The ${\sim}6NT$ FLOP estimate that underwrites all of this comes from a back-of-the-envelope account of the forward and backward passes through a Transformer: two FLOPs per parameter per token for the forward pass, four for the backward pass, summed and lightly rounded. It is accurate to a few percent for dense Transformers and is the unit of currency in every pre-training discussion.

Data mix

The single biggest determinant of model quality, holding architecture and compute fixed, is the training corpus. The naive approach of training on raw Common Crawl was abandoned around 2021. Modern pre-training corpora, typically 10–30 trillion tokens after filtering, are aggressively curated.

The mix that has converged is something like this. The bulk of tokens still come from web crawl: filtered Common Crawl, processed through a pipeline of language identification, perplexity filtering, and quality classifiers that score each document against a "high-quality reference" set such as Wikipedia and selected books. Code is heavily over-represented relative to its share of the open web, because code teaches structured reasoning, exact syntax and long-range dependency tracking. GitHub permissive-licence repositories, deduplicated and stripped of generated files, contribute on the order of 1–3 trillion tokens. Books, both via licensed corpora and via openly licensed collections, add another high-quality slice. Academic papers from arXiv and PubMed contribute the dense, technical register that downstream evaluations reward. Wikipedia, Stack Exchange and government documents fill in encyclopaedic and conversational coverage. Increasingly, synthetic data generated by smaller models, typically with a verification step that rejects outputs failing a unit test, a math checker or a critic model, is added on top, particularly for code and mathematics where verification is cheap.

Three steps in the pipeline are easy to underrate. Deduplication is the first. Identical and near-identical documents inflate apparent corpus size, encourage rote memorisation, and waste optimisation steps. Two-stage dedup is standard: SHA-256 of normalised text removes exact duplicates, MinHash-LSH over 5-grams removes near-duplicates above a Jaccard threshold of around 0.8. The number of "tokens" you nominally trained on can fall by a factor of two or three after this step. The second is decontamination: removing any document that contains a verbatim or close match to known evaluation prompts, to prevent the model from scoring well on benchmarks for the wrong reasons. The third is quality filtering. The Phi series of models from Microsoft demonstrated that synthetic, textbook-style data, generated and filtered for clarity and pedagogy, can give a small model the per-token education of a much larger raw-web corpus. A 10% improvement in data quality, measured as held-out loss per token, will routinely save 30% of compute for the same downstream performance. The phrase "data is the new compute" is overused, but the underlying observation, that data quality has more leverage than the field gives it credit for, is correct.

Mixing proportions are tuned empirically. Frontier labs run small-scale ablations at, say, the 1 B parameter scale, varying the share of web, code, papers, books and synthetic data, and pick the mix whose 1 B-parameter loss extrapolates best along the scaling laws to the target scale. The proportions that emerge from these sweeps differ across labs, but the qualitative shape is shared: web is downweighted relative to its raw share, code and curated mathematics are upweighted, and a small but disciplined slice of synthetic data improves reasoning benchmarks.

Tokenisation

Modern LLMs use subword tokenisers. The dominant algorithms are byte-pair encoding (BPE), WordPiece, and SentencePiece, which share a common shape: start with a base alphabet of bytes or Unicode characters, greedily merge the most frequent adjacent pair, repeat until the vocabulary reaches a target size, typically between 32 K and 256 K tokens. The result is a vocabulary that represents common English words as single tokens, common subwords as single tokens, and rare strings as compositions of shorter tokens or, in the byte-fallback case, as raw bytes.

The choice of tokeniser has more consequence than newcomers expect. A tokeniser tuned only on English will tokenise a Chinese or Arabic document into four or five times as many tokens as a multilingual tokeniser, with corresponding compute and inference costs. Modern tokenisers are deliberately multilingual, include explicit support for code (whitespace-aware tokens, indentation runs as single tokens), incorporate Unicode mathematical operators rather than fragmenting them into bytes, and reserve special tokens for chat structure: turn boundaries, system prompts, tool-call delimiters, image patches.

Byte-fallback matters more than it sounds. If your tokeniser cannot encode a Unicode character, you need a fallback to raw bytes; without it, the tokeniser is not a bijection on text, and obscure characters break round-tripping. A model that cannot perfectly encode and decode arbitrary input strings is brittle in ways that are hard to debug at deployment. Modern tokenisers are byte-level by construction, which avoids the issue altogether.

A subtler issue is tokeniser drift. Once a model has been pre-trained, its tokeniser is effectively immutable: changing it after the fact requires retraining the input and output embedding matrices and, in practice, retraining most of the model. This locks in early decisions about vocabulary size, multilingual coverage and special tokens for the lifetime of the model family.

Hyperparameters

The hyperparameter recipe has converged across labs. Learning rate sits at around $3 \times 10^{-4}$ for medium-scale models, decreasing modestly with scale; very large models often peak at $1$–$2 \times 10^{-4}$. The schedule is universally a short linear warmup of a few thousand steps, followed by cosine decay to roughly 10% of peak by the end of training. The optimiser is AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.95$, note the lower second-moment decay than the original Adam default of 0.999, which helps stabilise long runs, and $\epsilon = 10^{-8}$. Weight decay is typically 0.1, decoupled from the learning rate. Gradient clipping at global norm 1.0 is standard, which catches most loss-spike causes before they propagate.

Batch size, measured in tokens per step rather than sequences per step, is on the order of 1–8 million tokens for frontier runs. Larger batches improve throughput but degrade per-token learning efficiency past a critical batch size that itself scales with loss; this is why batch sizes ramp up rather than starting at the maximum. Mixed-precision training in BF16 for storage and compute, with FP32 master weights and FP32 reductions, has been standard since 2022. FP8 mixed precision, pioneered at scale by DeepSeek-V3, is increasingly common and roughly halves memory and compute cost in the matmuls that dominate training.

Annealing has become a recipe element of its own. The final 5–10% of training reduces the learning rate sharply and switches the data mix to a high-quality "annealing mix", often the cleanest synthetic data, the most curated math and code, and held-out reference text. A surprising fraction of the headline benchmark gains in 2024–2025 frontier models came from the annealing phase rather than the base of the run. Long-context fine-tuning happens here too: the context length is extended from a training-time 8 K or 32 K up to 128 K or 1 M, using a long-context-specific data mix and often modified positional embeddings (NTK-aware scaling, YaRN).

Parallelism

Frontier models do not fit on one GPU. A 70 B-parameter model in BF16 alone requires 140 GB of weights, plus optimiser state and activations; a single H100 has 80 GB. Training a frontier model therefore requires a parallelism strategy that distributes the model and the data across thousands of GPUs while keeping them busy.

Five strategies are now standard, and the largest runs combine all five. Data parallelism replicates the entire model across GPU groups and shards each batch across replicas; gradients are all-reduced across replicas at each step. It scales with batch size. Tensor parallelism splits each weight matrix across a small set of GPUs, typically four or eight within a node, communicating intermediate activations via NVLink or NVSwitch at every layer. It is bandwidth-hungry and is therefore confined to within-node groups. Pipeline parallelism partitions layers across GPU stages and feeds them micro-batches in a scheduled order; the inter-stage communication is small but introduces "bubbles" that have to be hidden by careful scheduling. Expert parallelism routes tokens to the GPUs that host their assigned experts in a mixture-of-experts model; routing is sparse, all-to-all, and one of the harder communication patterns to schedule efficiently. ZeRO and its descendant FSDP (fully sharded data parallel) shard optimiser state, gradients and parameters across data-parallel replicas, gathering and scattering them on demand; ZeRO-3 in particular makes 70 B-class training feasible at modest tensor-parallel widths.

A typical 4096-GPU configuration for a frontier dense run might use tensor parallel of 8 within a node, pipeline parallel of 8 across eight nodes, and data parallel of 64 across the resulting 64-node groups, with FSDP layered on top of the data dimension. For an MoE model, expert parallelism replaces or augments tensor parallelism for the expert layers. Sequence parallelism, which shards along the sequence dimension, has become important for long-context training where activation memory dominates.

The engineering payoff of all this is measured in model FLOPs utilisation (MFU): the fraction of theoretical peak FLOPs the cluster actually delivers. A poorly tuned stack achieves 20% MFU; a well-tuned one achieves 50–55% on H100s; FP8-aware stacks have pushed past 60%. Every five percentage points of MFU is days of wall-clock and millions of dollars on a frontier run.

Monitoring

A frontier pre-training run lasts weeks to months on a cluster whose individual GPUs fail every few thousand hours. Monitoring is not optional. The standard dashboard logs training loss, validation loss on a held-out mix, gradient norms (global and per-layer), weight norms, the learning rate, the tokens-per-second throughput, and a small panel of cheap downstream evaluations run every few thousand steps. Healthy runs show loss declining smoothly on a log-log plot, gradient norms stable around their clipping threshold, and weight norms growing slowly.

Loss spikes are the canonical failure mode. A spike that recovers within a few hundred steps is usually benign and can be left alone; a spike that diverges requires intervention. The standard playbook is to roll back to the most recent healthy checkpoint, skip the data shard that caused the spike (it is often a single pathological document), and resume. Hardware failures, a GPU dying, a node losing networking, a switch flapping, are continuous; the orchestration layer must restart workers, rebuild the parallelism topology, and resume from checkpoint without operator intervention. Frontier labs spend as much engineering effort on the supervisor and checkpointing system as on the model code itself.

What you should take away

The Chinchilla recipe says compute should be split roughly evenly between making the model bigger and training it on more tokens, with $T/N \approx 20$ as a starting point for compute-optimal training; deployment-optimal models are deliberately overtrained past this ratio.
Data is the dominant lever once architecture and compute are fixed: deduplication, decontamination, quality filtering and a deliberately tuned mix of web, code, books, papers and synthetic data routinely give 30%-compute savings for the same loss.
The tokeniser is locked in for the lifetime of a model family; choose multilingual byte-level BPE with code- and math-aware special tokens, and a vocabulary in the 100 K–256 K range.
The hyperparameter recipe, AdamW with $\beta_2 = 0.95$, weight decay 0.1, gradient clip 1.0, cosine schedule with linear warmup, BF16 or FP8 mixed precision, has converged, and most innovation now happens in the data and the annealing phase rather than the optimiser.
Frontier training combines data, tensor, pipeline, expert and ZeRO/FSDP parallelism on clusters of thousands of GPUs; the engineering goal is sustained MFU above 50%, and the operational goal is a supervisor that keeps the run alive through inevitable hardware and data failures.