Knowledge Distillation, Glossary, Textbook of AI

Also known as: KD

Knowledge distillation (Hinton, Vinyals & Dean, 2015) is a compression technique in which a small student model is trained to imitate a large teacher model's predictions rather than the original training labels. The premise is that the teacher's full output distribution carries far more information than a one-hot label: when an image classifier sees a Siamese cat, the relative probabilities it assigns to "Persian cat" (high), "Burmese cat" (high), "fox" (low) and "fire engine" (negligible) encode learned similarities that the bare label "Siamese cat" does not. Hinton called this the dark knowledge in the teacher's outputs.

Distillation operationalises this with a temperature-scaled softmax. Given pre-softmax logits $z_i$, the standard softmax gives probabilities $p_i = \exp(z_i)/\sum_j \exp(z_j)$. The temperature-$T$ softmax is

$$p_i^{(T)} = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}.$$

At $T = 1$ this is the ordinary softmax; as $T \to \infty$ it tends towards uniform; for $T > 1$ small differences in logits become more visible in the probabilities. The teacher's logits $z^\mathrm{teacher}$ and the student's logits $z^\mathrm{student}$ are both passed through the same temperature, and the student is trained to minimise

$$\mathcal{L}_\mathrm{KD} = T^2 \cdot D_\mathrm{KL}\!\left(p^{\mathrm{teacher}, T} \,\big\|\, p^{\mathrm{student}, T}\right).$$

The $T^2$ scaling preserves gradient magnitudes, without it, the soft targets would produce vanishingly small gradients at high temperature. The full training loss combines distillation with a small weight on the standard hard-label cross-entropy:

$$\mathcal{L} = \alpha \cdot \mathcal{L}_\mathrm{KD} + (1-\alpha) \cdot \mathcal{L}_\mathrm{CE}(y, p^{\mathrm{student}, T=1}).$$

Hinton's original experiments showed a 10-layer student matching the accuracy of a 1000-net ensemble teacher on speech recognition. Subsequent work generalised the recipe in several directions:

Feature distillation (FitNets, Romero et al., 2015): the student matches not only the teacher's output but also intermediate hidden representations, possibly with a learned projection to bridge dimensionality differences.
Attention distillation (Zagoruyko & Komodakis, 2017): match the spatial attention maps between teacher and student CNNs.
Self-distillation (Furlanello et al., 2018): the teacher and student share an architecture; the student is trained from a teacher checkpoint, then becomes the teacher for the next round. Counter-intuitively, this often improves over the original teacher.
Sequence-level KD (Kim & Rush, 2016) for language models: the teacher generates outputs, and the student is trained on (input, teacher-output) pairs as if they were ground-truth, a particularly effective recipe for translation and modern instruction-tuned LLMs.

For modern LLMs, distillation underlies several flagship deployments. DistilBERT (Sanh et al., 2019) trained a 66M-parameter student to recover 97% of BERT-base's GLUE score with 40% fewer parameters. TinyLlama and MiniCPM distil from larger Llama variants; Gemma and Llama-3 small variants are reported to use teacher-distillation signals during pretraining. In the agent and reasoning domain, distilling chain-of-thought traces from a frontier model (e.g. GPT-4) into a 7B student has produced systems that approach the teacher's reasoning quality at a fraction of the inference cost.

Distillation composes with pruning and quantisation: the standard production pipeline is to start from a large pre-trained teacher, train a smaller student via distillation, prune the student's redundant weights, and finally quantise to 4-bit for deployment.

Related terms: Pruning, Quantisation, Softmax, Transformer, GPTQ

Discussed in:

Chapter 15: Modern AI, Efficient AI
Chapter 15: Modern AI, Engineering at Scale

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).