Knowledge Distillation
DistillationTraining a smaller "student" model to mimic the outputs of a larger "teacher" so you get most of the quality at a fraction of the cost.
In one line
Train a small model to match the outputs (or intermediate representations) of a big model on the same inputs.
What it actually means
You run a teacher model on your training data and record its soft outputs — the full probability distribution, not just the top-1 label. The student is then trained to match that distribution, usually with a temperature-softened cross-entropy loss, sometimes combined with the original hard-label loss. The soft targets carry more signal than one-hot labels: if the teacher says “60% cat, 35% lynx, 5% dog”, the student learns that cats look like lynxes, which is a fact a one-hot label destroys.
Why it matters
A distilled model runs faster, costs less to serve, and fits on smaller hardware. DistilBERT, TinyBERT, and a long list of production models are distilled from larger teachers. In the LLM era, distillation is how you bake the behavior of a huge frontier model into a 7B open model you can actually afford to host.
Example
# teacher: huge model, eval mode
# student: small model being trained
loss = alpha * KL(softmax(student_logits / T), softmax(teacher_logits / T)) * T**2 \
+ (1 - alpha) * cross_entropy(student_logits, labels)
You’ll hear it when
- Shrinking a model for production serving.
- Training an on-device or edge model.
- Discussing “synthetic data” pipelines from a big model to a small one.
- Comparing distillation vs quantization vs pruning.