15.19 End-to-end recipe

We close with pseudocode for a full pipeline, scoped to be achievable with modest resources (a single 24 GB GPU). The aim is not a frontier model but a complete walk-through of the modern recipe at small scale.

Stage 1: train a 100M GPT-2 from scratch

# Configuration
config = {
    "n_layer": 12, "n_head": 12, "n_embd": 768,
    "vocab_size": 50_257, "block_size": 1024,
    "batch_size": 16, "grad_accum": 32,  # effective batch 512
    "lr": 6e-4, "min_lr": 6e-5,
    "warmup_steps": 2000, "max_steps": 100_000,
    "weight_decay": 0.1, "beta1": 0.9, "beta2": 0.95,
    "grad_clip": 1.0,
    "data": "fineweb-edu",  # ~10 B tokens, high quality web
}

# Model: standard decoder-only Transformer
model = GPT(config).to(device).to(torch.bfloat16)
optimizer = AdamW(model.parameters(), lr=config["lr"],
                  betas=(config["beta1"], config["beta2"]),
                  weight_decay=config["weight_decay"])
scheduler = cosine_with_warmup(optimizer, config["warmup_steps"], config["max_steps"])

# Training loop with gradient accumulation
for step in range(config["max_steps"]):
    loss = 0.0
    for _ in range(config["grad_accum"]):
        x, y = next(data_iter)              # (B, T) and (B, T) shifted
        logits = model(x)                    # (B, T, V)
        l = F.cross_entropy(logits.view(-1, V), y.view(-1)) / config["grad_accum"]
        l.backward()
        loss += l.item()
    torch.nn.utils.clip_grad_norm_(model.parameters(), config["grad_clip"])
    optimizer.step()
    optimizer.zero_grad()
    scheduler.step()
    if step % 100 == 0:
        log(step=step, loss=loss, lr=scheduler.get_last_lr())
    if step % 5000 == 0:
        save_checkpoint(model, optimizer, step)

At 100 K steps with batch 512 × 1024 = 524 K tokens per step, this trains on ~52 B tokens, plenty for a 100 M-parameter model under Chinchilla scaling. Expect a final perplexity around 15 on held-out FineWeb-Edu, and recognisable but unreliable text generation. On a single H100 this runs in about 24 hours.

Stage 2: LoRA fine-tune for instruction following

# Load pre-trained base
base = load_pretrained("./checkpoints/step_100000.pt")

# Inject LoRA adapters
for name, module in base.named_modules():
    if isinstance(module, nn.Linear) and ("attn" in name or "mlp" in name):
        replace_with_lora(module, rank=16, alpha=32)

# Freeze base; only LoRA params trainable
for name, p in base.named_parameters():
    p.requires_grad = "lora" in name
print(f"Trainable: {sum(p.numel() for p in base.parameters() if p.requires_grad):,}")
# typically ~0.5% of base params

# SFT data: chat-formatted (instruction, response) pairs
sft_data = load_dataset("HuggingFaceH4/ultrachat_200k")

# Training: standard cross-entropy, masked to response tokens
optimizer = AdamW([p for p in base.parameters() if p.requires_grad], lr=2e-4)
for epoch in range(3):
    for batch in sft_data:
        formatted = chat_template(batch)  # apply <|user|>...<|assistant|>... tokens
        x, y, response_mask = tokenize_and_mask(formatted)
        logits = base(x)
        loss = F.cross_entropy(logits.view(-1, V), y.view(-1), reduction="none")
        loss = (loss * response_mask.view(-1)).sum() / response_mask.sum()
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Save LoRA adapters only (~few MB)
save_lora_adapters(base, "./adapters/sft")

After 3 epochs on 200 K examples, the resulting model follows instructions, applies chat formatting, and refuses obvious harmful requests with reasonable consistency. The LoRA file is around 8 MB; the full fine-tune would be ~400 MB. This stage takes a few hours on a 24 GB GPU.

Stage 3: DPO preference optimisation

# Load SFT model (with merged LoRA), use as both policy and reference
policy = load_with_adapters(base, "./adapters/sft")
reference = deepcopy(policy)
for p in reference.parameters():
    p.requires_grad = False

# Add fresh LoRA adapters to policy for DPO update
add_lora(policy, rank=16, alpha=32)

# Preference data: (prompt, chosen, rejected)
pref_data = load_dataset("HuggingFaceH4/ultrafeedback_binarized")

beta = 0.1
optimizer = AdamW([p for p in policy.parameters() if p.requires_grad], lr=5e-6)

for batch in pref_data:
    x, y_w, y_l = batch["prompt"], batch["chosen"], batch["rejected"]

    log_pi_w   = log_prob(policy,    x, y_w)
    log_pi_l   = log_prob(policy,    x, y_l)
    log_ref_w  = log_prob(reference, x, y_w)
    log_ref_l  = log_prob(reference, x, y_l)

    # Implicit reward differences
    delta_w = beta * (log_pi_w - log_ref_w)
    delta_l = beta * (log_pi_l - log_ref_l)

    # DPO loss
    loss = -F.logsigmoid(delta_w - delta_l).mean()
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

save_lora_adapters(policy, "./adapters/dpo")

Two passes over ~60 K preference pairs, one to two hours of GPU time. The output is a chat model whose preference for "helpful and accurate" over "rambling or wrong" responses has been reinforced.

Stage 4: evaluation

def evaluate(model, suite):
    results = {}
    # MMLU-Pro: multiple choice, log-likelihood ranking
    results["mmlu_pro"] = mmlu_pro_eval(model)
    # GSM8K: math word problems, exact-match on final answer
    results["gsm8k"]    = gsm8k_eval(model, n_shot=8, cot=True)
    # IFEval: instruction following, rule-based checks
    results["ifeval"]   = ifeval(model)
    # MT-Bench: pairwise judging by a stronger judge model
    results["mt_bench"] = mt_bench(model, judge="gpt-4o")
    return results

base_model_results   = evaluate(base, all_suites)            # the raw base
sft_model_results    = evaluate(sft_model, all_suites)       # after SFT
dpo_model_results    = evaluate(dpo_model, all_suites)       # after DPO

print_table(
    rows=["base", "sft", "dpo"],
    columns=["mmlu_pro", "gsm8k", "ifeval", "mt_bench"],
    data=[base_model_results, sft_model_results, dpo_model_results],
)

For a 100 M model trained for one day, expect very modest absolute numbers (MMLU-Pro near random, GSM8K under 10%) but a clear monotone improvement: base < SFT < DPO. The point of running this end to end is to internalise that the recipe, pre-train → SFT → preferences → evaluate, is the same at 100 M and at 100 B; the only difference is the resources.

What you skipped

A real frontier run would add: proper data curation, mid-training, longer pre-training, a much larger SFT set, multiple stages of preference and verifiable-reward RL, process supervision, tool-use training, multimodal extensions, long-context fine-tuning, distillation to smaller serving models. Each of these is by now a well-documented sub-recipe. The core loop, however, predict tokens, fit instructions, fit preferences, evaluate, is exactly what is on this page.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).