Glossary

Knowledge Distillation

Knowledge Distillation transfers the capabilities of a large "teacher" model to a smaller "student" model. The student is trained to match not just the teacher's hard predictions (argmax of the output distribution) but its soft predictions—the full probability distribution over the output space. Soft targets carry richer information than hard labels: a teacher's 0.7/0.2/0.1 distribution over three classes reveals that the second class is more similar to the correct one than the third, information completely absent from a one-hot label.

Hinton, Vinyals, and Dean (2015) introduced the modern formulation with a temperature parameter $T$ that softens both teacher and student output distributions via $\text{softmax}(z/T)$. Higher temperatures reveal more of the fine structure in the teacher's "dark knowledge"—the relationships it has learned between classes. The student's loss typically combines distillation loss (against the soft teacher targets) with standard cross-entropy loss against the true labels.

Distillation can produce student models 2–10× smaller than the teacher while retaining 90–95% of the teacher's performance. It has enabled many successful deployments: DistilBERT (a distilled BERT), TinyBERT, MobileBERT, and many more. Modern LLM distillation goes further: students are trained not just on teacher output distributions but on teacher chain-of-thought reasoning, allowing smaller models to emulate the reasoning capabilities of much larger ones. Combined with quantisation and pruning, distillation is essential for deploying AI at scale, on edge devices, and under resource constraints.

Related terms: Quantisation, Pruning

Discussed in:

Also defined in: Textbook of AI