LoRA (Low-Rank Adaptation), introduced by Edward Hu et al. at Microsoft Research in 2021, is a parameter-efficient fine-tuning method that injects trainable low-rank decomposition matrices into the weights of a pre-trained model while freezing the original weights. Specifically, for a frozen weight matrix W ∈ ℝ^{d×k}, LoRA adds a learnable update ΔW = BA, where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, and r ≪ min(d, k). Typical choices: r = 8 or 16.
The result is a fine-tuning method that: reduces the number of trainable parameters by orders of magnitude (often >100×); requires correspondingly less GPU memory during training; produces small adapter checkpoints (often <100MB) that can be plugged in or out of the base model at inference time; can be merged into the base weights for zero inference overhead.
LoRA enables fine-tuning of multi-billion-parameter LLMs on consumer GPUs that could not otherwise hold the optimiser state for full fine-tuning. The method is now the dominant fine-tuning approach for open-source LLM adaptation.
The QLoRA extension (Dettmers et al., 2023) combines LoRA with 4-bit quantisation of the base model, enabling fine-tuning of 65B-parameter models on a single 48GB GPU. The Hugging Face PEFT library makes LoRA-style adapters trivial to deploy in practice. Many specialised LoRA variants (DoRA, AdaLoRA, GLoRA) refine the basic technique.
Mathematics
For a frozen pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA learns a low-rank update $\Delta W = B A$ where
$$B \in \mathbb{R}^{d \times r}, \quad A \in \mathbb{R}^{r \times k}, \quad r \ll \min(d, k).$$
The full matrix used in forward pass is $W_0 + \Delta W = W_0 + B A$. Only $B$ and $A$ are learned; $W_0$ is frozen.
Total trainable parameters: $r(d + k)$, compared with $dk$ for full fine-tuning. For a typical Transformer attention matrix with $d = k = 4096$ and $r = 8$:
- Full fine-tuning: $dk = 16{,}777{,}216$ parameters
- LoRA: $r(d+k) = 65{,}536$ parameters , a 256× reduction.
Initialisation: $B = 0$, $A \sim \mathcal{N}(0, \sigma^2)$, so $\Delta W = 0$ at the start of fine-tuning , the model behaves identically to the base. This guarantees that fine-tuning starts from the pre-trained behaviour and only diverges as needed.
Scaling: in practice the update is scaled by $\alpha / r$ with $\alpha$ a fixed hyperparameter:
$$h = W_0 x + \frac{\alpha}{r} B A x$$
Common choice: $\alpha = 2r$.
Inference: for deployment, the update can be merged into the base weights as $W = W_0 + (\alpha/r) B A$, giving zero inference overhead. Multiple LoRA adapters for different tasks can be swapped in or out at inference time.
QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantisation of the frozen base, enabling fine-tuning of 65B models on a single 48GB GPU. The base is quantised to NF4 (a normal-distribution-aware 4-bit format), the LoRA matrices remain in FP16/BF16, and double-quantisation compresses the quantisation constants themselves to save additional memory.
LoRA's empirical success (Hu et al., 2021) demonstrates that the intrinsic dimension of fine-tuning updates is small , most of the change between a pre-trained and fine-tuned model is captured by a low-rank perturbation. This has been a recurring methodological observation: deep models live in a much lower-dimensional manifold than their parameter count suggests.
Video
Related terms: edward-hu, Fine-Tuning
Discussed in:
- Chapter 15: Modern AI, Modern AI