Fine-tune with LoRA, Glossary, Textbook of AI

LoRA (Hu et al. 2021) freezes a pretrained model and injects trainable low-rank matrices $A \in \mathbb{R}^{r\times d}$, $B \in \mathbb{R}^{d\times r}$ into selected linear layers, so the effective weight is

$$W' = W_0 + \frac{\alpha}{r} \cdot B A,$$

with $W_0$ frozen and only $A, B$ trained. With $r=8$ on a 7B model, this is ~~4M trainable params (~~0.06%), fits in 24GB consumer GPU at bf16.

Data pipeline. Instruction-following dataset (e.g. Alpaca, 52k examples, or OpenAssistant). Format each example as a single string:

<s>[INST] {instruction}\n{input} [/INST] {output} </s>

Tokenise with the base model's tokeniser. Mask the loss on the prompt, only train on the response tokens:

def tokenise(ex):
    prompt = f"[INST] {ex['instruction']}\n{ex['input']} [/INST] "
    full   = prompt + ex["output"] + tokenizer.eos_token
    ids    = tokenizer(full).input_ids
    labels = ids.copy()
    prompt_len = len(tokenizer(prompt).input_ids)
    labels[:prompt_len] = [-100] * prompt_len  # ignore prompt in loss
    return {"input_ids": ids, "labels": labels}

Where to inject LoRA. Standard practice: inject into the attention projections only (q_proj, k_proj, v_proj, o_proj). For best quality, also inject into MLP up_proj, down_proj, gate_proj. Skip embeddings, LayerNorm, and the LM head.

class LoRALinear(nn.Module):
    def __init__(self, base: nn.Linear, r=16, alpha=32, dropout=0.05):
        super().__init__()
        self.base = base                          # frozen
        for p in self.base.parameters(): p.requires_grad = False
        self.A = nn.Parameter(torch.randn(r, base.in_features) * (1/r**0.5))
        self.B = nn.Parameter(torch.zeros(base.out_features, r))    # init B=0 → adapter starts at identity
        self.scale = alpha / r
        self.drop = nn.Dropout(dropout)
    def forward(self, x):
        return self.base(x) + self.drop(x @ self.A.T) @ self.B.T * self.scale

def inject_lora(model, target_names=("q_proj","v_proj","o_proj"), r=16, alpha=32):
    for name, mod in model.named_modules():
        for child in list(mod._modules):
            if child in target_names and isinstance(mod._modules[child], nn.Linear):
                mod._modules[child] = LoRALinear(mod._modules[child], r, alpha)

Initialising $B=0$ is critical, it means the adapted model is bit-identical to the base model at step 0, so training cannot diverge from a bad random init.

Hyperparameters.

Rank $r=8$ or $16$, $\alpha = 2r$.
LoRA dropout 0.05.
AdamW, $\beta_1=0.9$, $\beta_2=0.999$, weight decay 0.0 on adapters.
Peak LR $3\times 10^{-4}$ (10× higher than full FT, adapters are smaller and more stable). Cosine decay to 0 over 3 epochs.
Effective batch size 128 (e.g. micro-batch 4 × grad-accum 32 on a single GPU).
bf16 training; load base in NF4 quantisation (QLoRA) to halve memory.

Training loop.

base = AutoModelForCausalLM.from_pretrained("Llama-3-8B",
          quantization_config=BitsAndBytesConfig(load_in_4bit=True,
              bnb_4bit_compute_dtype=torch.bfloat16))
inject_lora(base, target_names=("q_proj","k_proj","v_proj","o_proj",
                                "up_proj","down_proj","gate_proj"), r=16, alpha=32)
trainable = [p for p in base.parameters() if p.requires_grad]
print(f"trainable: {sum(p.numel() for p in trainable)/1e6:.1f}M")

opt = torch.optim.AdamW(trainable, lr=3e-4, betas=(0.9, 0.999), weight_decay=0.0)
sched = cosine_schedule(opt, total_steps=epochs * len(loader) // grad_accum)

for epoch in range(3):
    for step, batch in enumerate(loader):
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = base(**batch)
            loss = out.loss / grad_accum
        loss.backward()
        if (step+1) % grad_accum == 0:
            torch.nn.utils.clip_grad_norm_(trainable, 1.0)
            opt.step(); sched.step(); opt.zero_grad()

Backward pass. Only $A, B$ have requires_grad=True, so autograd only allocates gradient buffers for them. Activations through the frozen base are still recomputed (use gradient checkpointing to halve activation memory).

Compute estimate. ~3 epochs over 52k examples × ~512 tokens ≈ 80M training tokens. On a single A100, ~3-6 hours. Final adapter is a ~30MB file; the user distributes only the adapter and merges at load time.

Pitfalls. Forgetting to mask prompt tokens in the loss makes the model echo instructions back. Setting $\alpha = r$ (instead of $2r$) under-uses the adapter. Targeting only q_proj/v_proj saves params but underfits, include MLP layers for instruction-following. Loading base in fp16 (not bf16) on Ampere+ wastes 2x memory. Merging adapters into the base $W \leftarrow W_0 + (\alpha/r) BA$ for deployment removes the runtime cost.

Video

Related terms: LoRA, Transformer, Adam, Mixed Precision Training

Discussed in:

Chapter 11: CNNs, Parameter-Efficient Fine-Tuning
Chapter 13: Attention & Transformers, Adapting LLMs

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).