15.19 End-to-end recipe
We close with pseudocode for a full pipeline, scoped to be achievable with modest resources (a single 24 GB GPU). The aim is not a frontier model but a complete walk-through of the modern recipe at small scale.
Stage 1: train a 100M GPT-2 from scratch
# Configuration
config = {
"n_layer": 12, "n_head": 12, "n_embd": 768,
"vocab_size": 50_257, "block_size": 1024,
"batch_size": 16, "grad_accum": 32, # effective batch 512
"lr": 6e-4, "min_lr": 6e-5,
"warmup_steps": 2000, "max_steps": 100_000,
"weight_decay": 0.1, "beta1": 0.9, "beta2": 0.95,
"grad_clip": 1.0,
"data": "fineweb-edu", # ~10 B tokens, high quality web
}
# Model: standard decoder-only Transformer
model = GPT(config).to(device).to(torch.bfloat16)
optimizer = AdamW(model.parameters(), lr=config["lr"],
betas=(config["beta1"], config["beta2"]),
weight_decay=config["weight_decay"])
scheduler = cosine_with_warmup(optimizer, config["warmup_steps"], config["max_steps"])
# Training loop with gradient accumulation
for step in range(config["max_steps"]):
loss = 0.0
for _ in range(config["grad_accum"]):
x, y = next(data_iter) # (B, T) and (B, T) shifted
logits = model(x) # (B, T, V)
l = F.cross_entropy(logits.view(-1, V), y.view(-1)) / config["grad_accum"]
l.backward()
loss += l.item()
torch.nn.utils.clip_grad_norm_(model.parameters(), config["grad_clip"])
optimizer.step()
optimizer.zero_grad()
scheduler.step()
if step % 100 == 0:
log(step=step, loss=loss, lr=scheduler.get_last_lr())
if step % 5000 == 0:
save_checkpoint(model, optimizer, step)
At 100 K steps with batch 512 × 1024 = 524 K tokens per step, this trains on ~52 B tokens, plenty for a 100 M-parameter model under Chinchilla scaling. Expect a final perplexity around 15 on held-out FineWeb-Edu, and recognisable but unreliable text generation. On a single H100 this runs in about 24 hours.
Stage 2: LoRA fine-tune for instruction following
# Load pre-trained base
base = load_pretrained("./checkpoints/step_100000.pt")
# Inject LoRA adapters
for name, module in base.named_modules():
if isinstance(module, nn.Linear) and ("attn" in name or "mlp" in name):
replace_with_lora(module, rank=16, alpha=32)
# Freeze base; only LoRA params trainable
for name, p in base.named_parameters():
p.requires_grad = "lora" in name
print(f"Trainable: {sum(p.numel() for p in base.parameters() if p.requires_grad):,}")
# typically ~0.5% of base params
# SFT data: chat-formatted (instruction, response) pairs
sft_data = load_dataset("HuggingFaceH4/ultrachat_200k")
# Training: standard cross-entropy, masked to response tokens
optimizer = AdamW([p for p in base.parameters() if p.requires_grad], lr=2e-4)
for epoch in range(3):
for batch in sft_data:
formatted = chat_template(batch) # apply <|user|>...<|assistant|>... tokens
x, y, response_mask = tokenize_and_mask(formatted)
logits = base(x)
loss = F.cross_entropy(logits.view(-1, V), y.view(-1), reduction="none")
loss = (loss * response_mask.view(-1)).sum() / response_mask.sum()
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Save LoRA adapters only (~few MB)
save_lora_adapters(base, "./adapters/sft")
After 3 epochs on 200 K examples, the resulting model follows instructions, applies chat formatting, and refuses obvious harmful requests with reasonable consistency. The LoRA file is around 8 MB; the full fine-tune would be ~400 MB. This stage takes a few hours on a 24 GB GPU.
Stage 3: DPO preference optimisation
# Load SFT model (with merged LoRA), use as both policy and reference
policy = load_with_adapters(base, "./adapters/sft")
reference = deepcopy(policy)
for p in reference.parameters():
p.requires_grad = False
# Add fresh LoRA adapters to policy for DPO update
add_lora(policy, rank=16, alpha=32)
# Preference data: (prompt, chosen, rejected)
pref_data = load_dataset("HuggingFaceH4/ultrafeedback_binarized")
beta = 0.1
optimizer = AdamW([p for p in policy.parameters() if p.requires_grad], lr=5e-6)
for batch in pref_data:
x, y_w, y_l = batch["prompt"], batch["chosen"], batch["rejected"]
log_pi_w = log_prob(policy, x, y_w)
log_pi_l = log_prob(policy, x, y_l)
log_ref_w = log_prob(reference, x, y_w)
log_ref_l = log_prob(reference, x, y_l)
# Implicit reward differences
delta_w = beta * (log_pi_w - log_ref_w)
delta_l = beta * (log_pi_l - log_ref_l)
# DPO loss
loss = -F.logsigmoid(delta_w - delta_l).mean()
loss.backward()
optimizer.step()
optimizer.zero_grad()
save_lora_adapters(policy, "./adapters/dpo")
Two passes over ~60 K preference pairs, one to two hours of GPU time. The output is a chat model whose preference for "helpful and accurate" over "rambling or wrong" responses has been reinforced.
Stage 4: evaluation
def evaluate(model, suite):
results = {}
# MMLU-Pro: multiple choice, log-likelihood ranking
results["mmlu_pro"] = mmlu_pro_eval(model)
# GSM8K: math word problems, exact-match on final answer
results["gsm8k"] = gsm8k_eval(model, n_shot=8, cot=True)
# IFEval: instruction following, rule-based checks
results["ifeval"] = ifeval(model)
# MT-Bench: pairwise judging by a stronger judge model
results["mt_bench"] = mt_bench(model, judge="gpt-4o")
return results
base_model_results = evaluate(base, all_suites) # the raw base
sft_model_results = evaluate(sft_model, all_suites) # after SFT
dpo_model_results = evaluate(dpo_model, all_suites) # after DPO
print_table(
rows=["base", "sft", "dpo"],
columns=["mmlu_pro", "gsm8k", "ifeval", "mt_bench"],
data=[base_model_results, sft_model_results, dpo_model_results],
)
For a 100 M model trained for one day, expect very modest absolute numbers (MMLU-Pro near random, GSM8K under 10%) but a clear monotone improvement: base < SFT < DPO. The point of running this end to end is to internalise that the recipe, pre-train → SFT → preferences → evaluate, is the same at 100 M and at 100 B; the only difference is the resources.
What you skipped
A real frontier run would add: proper data curation, mid-training, longer pre-training, a much larger SFT set, multiple stages of preference and verifiable-reward RL, process supervision, tool-use training, multimodal extensions, long-context fine-tuning, distillation to smaller serving models. Each of these is by now a well-documented sub-recipe. The core loop, however, predict tokens, fit instructions, fit preferences, evaluate, is exactly what is on this page.