Modern AI: 15.4   Supervised fine-tuning

Dr Chris Paton

15.4 Supervised fine-tuning

A pretrained language model is, in a strict sense, only a model of text. It has been taught to predict the next token in a corpus of trillions of tokens drawn from web pages, books, code repositories, forums and scientific papers. It has not been taught to be helpful. Asked "Write a poem about clouds", a fresh base model is just as likely to continue with a list of related search queries, the boilerplate of a homework worksheet, or another instruction it imagines might follow in the wild. The base model is fluent and knowledgeable; it is also, by default, a sophisticated autocomplete engine rather than an assistant.

Supervised fine-tuning (SFT), sometimes called instruction tuning, is the first stage of post-training. It is the simplest possible bridge between the prediction objective of pretraining and the conversational behaviour expected of a chatbot. The recipe is conceptually trivial: continue training with the same next-token objective, but on a curated dataset of (instruction, response) pairs, with the loss masked to the response tokens only. The model still learns to predict the next token; we have simply changed the distribution it predicts from.

This section covers the first refinement on top of pretraining (§15.3); §15.5 introduces RLHF, which sharpens preferences further. SFT is foundational: every modern instruction-following model passes through it, and for many practical applications a well-executed SFT step is sufficient on its own. RLHF is the polish on top of a good SFT base, not a substitute for it.

Symbols Used Here

$x$instruction (prompt the user provides)

$y$desired response (the assistant's reply)

$\theta$model parameters

$\pi_\theta$the model's conditional distribution over tokens

The data

The data is the project. SFT corpora typically contain between $10^4$ and $10^6$ pairs, with the upper end now common at frontier labs. The lower end is enough to teach a base model the formatting conventions of a chat assistant; the upper end is needed to cover the long tail of skills users actually demand. There are four broad sources, often combined.

The first is human-written demonstrations. Skilled annotators are given a prompt and asked to write the ideal response. OpenAI's original InstructGPT corpus was assembled this way, as was the Anthropic Helpful and Harmless dataset. Quality is high but cost per example is around a pound to several pounds depending on length and domain expertise required. Medical, legal and code review prompts can run an order of magnitude higher. The OpenAssistant project (LAION, 2023) crowdsourced a comparable dataset openly, with volunteers writing both prompts and responses and rating each other's work; the result was roughly 161,000 messages across 35 languages.

The second is distillation from a stronger model. A frontier model is prompted with a large set of instructions and its outputs are used as training targets for a smaller student. Alpaca (Stanford, 2023) was the canonical demonstration: 52,000 instructions generated by GPT-3.5 from 175 seed tasks, used to fine-tune the 7B Llama base into a passable assistant for a few hundred dollars of compute. Vicuna, WizardLM, OpenHermes and the Dolphin family followed the same pattern. Distillation is now standard practice; the legal status of using closed-model outputs for training is contested but widely ignored in the open-source community.

The third is mined dialogue from public sources. ShareGPT collected user-submitted ChatGPT transcripts; forum scrapes of Stack Exchange, Quora and specialist boards yield naturally occurring question-answer pairs. The data is cheap and abundant but uneven, and copyright provenance is murky.

The fourth is task-format conversion. FLAN (Wei et al., 2022) and T0 reformatted hundreds of existing academic NLP datasets (translation, summarisation, classification, reading comprehension) into instruction-response form, taking advantage of decades of curated supervised data. These taught models to follow instructions in the abstract but produced stilted assistants compared to natural-style demonstrations.

A consistent finding (Zhou et al., 2023, in the LIMA paper) is that quality dominates quantity. A thousand carefully written examples can outperform a million scraped ones. The ceiling is set by the worst examples in the set, not the best, because the model will happily learn the failure modes alongside the successes. Modern pipelines therefore spend most of their human budget on filtering rather than writing.

The training

Given pairs $(x_i, y_i)$ where $x_i$ is the instruction and $y_i$ the response, the SFT objective is

$$ \mathcal{L}_{\text{SFT}}(\theta) = -\sum_i \sum_{t=1}^{|y_i|} \log \pi_\theta(y_{i,t} \mid x_i, y_{i,

Three details matter. First, the loss is masked on the instruction. Each training example concatenates the instruction tokens and the response tokens into a single sequence, and the model sees the whole thing during the forward pass, but gradient is computed only on positions that lie in the response. This is essential. If we computed loss on the instruction tokens too, the model would be trained to generate plausible user prompts as well as plausible assistant replies, which is precisely the unfocused behaviour we are trying to remove. In practice the mask is implemented as a per-token weight in the cross-entropy: ones for response tokens, zeros for instruction and padding tokens. A common bug in homemade fine-tuning scripts is to forget the mask, after which the model develops an odd habit of completing the user's question rather than answering it.

Second, the learning rate is small. Pretraining typically runs at peak learning rates around $3 \times 10^{-4}$ to $6 \times 10^{-4}$ with cosine decay; SFT runs at $10^{-5}$ to $5 \times 10^{-5}$, one or two orders of magnitude lower. The reason is that the base model's weights already encode an enormous amount of useful structure, and a large learning rate would stamp on it. Catastrophic forgetting, where the model loses general capability while learning the new task, is the failure mode we are guarding against. A short linear warmup over the first few hundred steps, followed by cosine or constant decay, is standard.

Third, few epochs. Two to four passes over the SFT corpus is typical. Going further begins to overfit: the model memorises the response phrasing of its training set and loses generalisation. Some practitioners run a single epoch on very large corpora and report better held-out evaluation than three epochs on a smaller one; the right answer depends on corpus size and quality. Effective batch size is usually large (256 to 1024 sequences), achieved through gradient accumulation, with sequences packed up to context length to make full use of each forward pass. Mixed-precision training (bfloat16 weights and activations, fp32 master weights for the optimiser) is standard. Optimiser state is AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.95$, weight decay $0.1$.

A 7B model can be SFT-tuned on a single 8-GPU node in a few hours; a 70B model requires fully sharded data parallel (FSDP) or tensor parallelism across multiple nodes. Either way, SFT is a fraction of the cost of pretraining, a useful property, since the SFT corpus is the lever that practitioners actually pull when adapting a model to a new domain.

Worked example

Concretely, consider taking Llama-3-8B-Base, a base model with no instruction-following capability, and fine-tuning it on the 52,000-example Alpaca dataset. The pipeline is straightforward. Each example becomes a string of the form ### Instruction: {x}\n\n### Response: {y} (or, in modern setups, the same content wrapped in <|user|> and <|assistant|> chat template tokens). The instruction tokens get loss weight zero; the response tokens get loss weight one. We train for three epochs at peak learning rate $2 \times 10^{-5}$, batch size 128, sequence length 2048, on eight A100 GPUs for around four hours.

The result is dramatically more usable. The fine-tuned model now follows simple instructions, writes structured responses, and stops continuing the user's prompt as if it were narration. It is a competent first-pass assistant. It is also, when compared with frontier models, recognisably limited. Reasoning is shallow. Refusals are inconsistent: the model sometimes refuses harmless requests because the Alpaca data contains a few clumsy refusal examples, and sometimes complies with clearly harmful requests because Alpaca contains almost no safety training. Long-form writing is repetitive. The model is helpful enough for a demo and useless enough to make clear why frontier labs spend tens of millions on subsequent RLHF and reasoning post-training. SFT gets you from base to assistant; it does not get you to the frontier.

LoRA fine-tuning

Full fine-tuning updates every parameter of the model, which means storing optimiser state for billions of weights. AdamW alone needs eight bytes per parameter for the moment estimates, on top of the parameters themselves and the gradient. A 7B model in bfloat16 with full AdamW state needs roughly 80 GB of GPU memory before activations, manageable on a single A100, painful on consumer hardware.

LoRA, low-rank adaptation (Hu et al., 2021), is the workaround. The insight is that the update $\Delta W$ that fine-tuning applies to a weight matrix $W$ is empirically low rank: most of the change can be captured by $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and $r$ is small (4 to 64). We freeze $W$, train only $A$ and $B$, and at inference either keep them as a separate adapter or merge them back: $W' = W + BA$. The number of trainable parameters drops by 100 to 1000 times. A 7B LoRA fine-tune fits comfortably on a single 24 GB consumer GPU. Quality is, for most tasks, indistinguishable from full fine-tuning.

LoRA is the backbone of consumer-grade fine-tuning. HuggingFace's PEFT library and Unsloth provide turnkey implementations; QLoRA (Dettmers et al., 2023) extends the idea to 4-bit quantised base weights, allowing 70B-scale fine-tuning on a single 48 GB GPU. The economic consequence is significant: it has democratised the ability to specialise a frontier base model, and most domain-specific medical, legal and coding assistants in 2026 are LoRA adapters layered onto a public base.

Limits of SFT alone

SFT teaches a model to imitate the responses in its dataset. It cannot teach the model to be better than its dataset. Several characteristic failure modes survive SFT and require RLHF, DPO or similar refinement to address. Sycophancy: the model agrees with whatever the user appears to believe, because its training data contained agreeable responses far more often than principled disagreement. Refusal inconsistency: the same harmful request phrased two different ways gets two different answers, because the boundary between refusal and compliance was demonstrated in only a few hundred SFT examples. Scope insensitivity: the model writes a one-paragraph answer to a question that needed a sentence and a sentence to one that needed an essay, because length calibration is hard to encode in demonstrations. Hallucination: the model confabulates facts, because the SFT loss rewards plausible-sounding text without distinguishing it from truthful text. Style drift: the model adopts whatever register dominated the dataset, regardless of context.

These behaviours are not bugs in the SFT recipe; they are the natural consequence of training on a finite set of demonstrations. Closing them requires a signal richer than imitation, either a learned reward model (RLHF, §15.5) or a contrastive objective over preference pairs (DPO, §15.6).

What you should take away

SFT is the bridge from base model to assistant. Continue next-token training on (instruction, response) pairs with the loss masked to the response tokens only.
Data quality dominates everything else. A few thousand careful examples beat a million noisy ones; the floor of the dataset sets the ceiling of the model.
The hyperparameters are conservative. Learning rate one to two orders of magnitude below pretraining, two to four epochs, large effective batch, AdamW, bfloat16, optional FSDP for big models.
LoRA makes fine-tuning cheap. Low-rank adapters with $r = 8$ to $64$ recover most of the quality of full fine-tuning at a hundredth of the memory and compute, and have democratised domain adaptation.
SFT alone is not enough for the frontier. Sycophancy, hallucination, refusal inconsistency and scope insensitivity all survive SFT. Closing the gap to frontier behaviour is the job of RLHF, DPO and the reasoning-specific methods covered in the following sections.