LoRA (Hu et al. 2021) freezes a pretrained model and injects trainable low-rank matrices $A \in \mathbb{R}^{r\times d}$, $B \in \mathbb{R}^{d\times r}$ into selected linear layers, so the effective weight is
$$W' = W_0 + \frac{\alpha}{r} \cdot B A,$$
with $W_0$ frozen and only $A, B$ trained. With $r=8$ on a 7B model, this is 4M
trainable params (0.06%), fits in 24GB consumer GPU at bf16.
Data pipeline. Instruction-following dataset (e.g. Alpaca, 52k examples, or OpenAssistant). Format each example as a single string:
<s>[INST] {instruction}\n{input} [/INST] {output} </s>
Tokenise with the base model's tokeniser. Mask the loss on the prompt, only train on the response tokens:
def tokenise(ex):
prompt = f"[INST] {ex['instruction']}\n{ex['input']} [/INST] "
full = prompt + ex["output"] + tokenizer.eos_token
ids = tokenizer(full).input_ids
labels = ids.copy()
prompt_len = len(tokenizer(prompt).input_ids)
labels[:prompt_len] = [-100] * prompt_len # ignore prompt in loss
return {"input_ids": ids, "labels": labels}
Where to inject LoRA. Standard practice: inject into the attention projections
only (q_proj, k_proj, v_proj, o_proj). For best quality, also inject into
MLP up_proj, down_proj, gate_proj. Skip embeddings, LayerNorm, and the LM head.
class LoRALinear(nn.Module):
def __init__(self, base: nn.Linear, r=16, alpha=32, dropout=0.05):
super().__init__()
self.base = base # frozen
for p in self.base.parameters(): p.requires_grad = False
self.A = nn.Parameter(torch.randn(r, base.in_features) * (1/r**0.5))
self.B = nn.Parameter(torch.zeros(base.out_features, r)) # init B=0 → adapter starts at identity
self.scale = alpha / r
self.drop = nn.Dropout(dropout)
def forward(self, x):
return self.base(x) + self.drop(x @ self.A.T) @ self.B.T * self.scale
def inject_lora(model, target_names=("q_proj","v_proj","o_proj"), r=16, alpha=32):
for name, mod in model.named_modules():
for child in list(mod._modules):
if child in target_names and isinstance(mod._modules[child], nn.Linear):
mod._modules[child] = LoRALinear(mod._modules[child], r, alpha)
Initialising $B=0$ is critical, it means the adapted model is bit-identical to the base model at step 0, so training cannot diverge from a bad random init.
Hyperparameters.
- Rank $r=8$ or $16$, $\alpha = 2r$.
- LoRA dropout 0.05.
- AdamW, $\beta_1=0.9$, $\beta_2=0.999$, weight decay 0.0 on adapters.
- Peak LR $3\times 10^{-4}$ (10× higher than full FT, adapters are smaller and more stable). Cosine decay to 0 over 3 epochs.
- Effective batch size 128 (e.g. micro-batch 4 × grad-accum 32 on a single GPU).
- bf16 training; load base in NF4 quantisation (QLoRA) to halve memory.
Training loop.
base = AutoModelForCausalLM.from_pretrained("Llama-3-8B",
quantization_config=BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16))
inject_lora(base, target_names=("q_proj","k_proj","v_proj","o_proj",
"up_proj","down_proj","gate_proj"), r=16, alpha=32)
trainable = [p for p in base.parameters() if p.requires_grad]
print(f"trainable: {sum(p.numel() for p in trainable)/1e6:.1f}M")
opt = torch.optim.AdamW(trainable, lr=3e-4, betas=(0.9, 0.999), weight_decay=0.0)
sched = cosine_schedule(opt, total_steps=epochs * len(loader) // grad_accum)
for epoch in range(3):
for step, batch in enumerate(loader):
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
out = base(**batch)
loss = out.loss / grad_accum
loss.backward()
if (step+1) % grad_accum == 0:
torch.nn.utils.clip_grad_norm_(trainable, 1.0)
opt.step(); sched.step(); opt.zero_grad()
Backward pass. Only $A, B$ have requires_grad=True, so autograd only allocates
gradient buffers for them. Activations through the frozen base are still recomputed
(use gradient checkpointing to halve activation memory).
Compute estimate. ~3 epochs over 52k examples × ~512 tokens ≈ 80M training tokens. On a single A100, ~3-6 hours. Final adapter is a ~30MB file; the user distributes only the adapter and merges at load time.
Pitfalls. Forgetting to mask prompt tokens in the loss makes the model echo
instructions back. Setting $\alpha = r$ (instead of $2r$) under-uses the adapter.
Targeting only q_proj/v_proj saves params but underfits, include MLP layers
for instruction-following. Loading base in fp16 (not bf16) on Ampere+ wastes 2x
memory. Merging adapters into the base $W \leftarrow W_0 + (\alpha/r) BA$ for
deployment removes the runtime cost.
Video
Related terms: LoRA, Transformer, Adam, Mixed Precision Training
Discussed in:
- Chapter 11: CNNs, Parameter-Efficient Fine-Tuning
- Chapter 13: Attention & Transformers, Adapting LLMs