Knowledge Distillation

Distillation
MLOps

Training a smaller "student" model to mimic the outputs of a larger "teacher" so you get most of the quality at a fraction of the cost.


In one line

Train a small model to match the outputs (or intermediate representations) of a big model on the same inputs.

What it actually means

You run a teacher model on your training data and record its soft outputs — the full probability distribution, not just the top-1 label. The student is then trained to match that distribution, usually with a temperature-softened cross-entropy loss, sometimes combined with the original hard-label loss. The soft targets carry more signal than one-hot labels: if the teacher says “60% cat, 35% lynx, 5% dog”, the student learns that cats look like lynxes, which is a fact a one-hot label destroys.

Why it matters

A distilled model runs faster, costs less to serve, and fits on smaller hardware. DistilBERT, TinyBERT, and a long list of production models are distilled from larger teachers. In the LLM era, distillation is how you bake the behavior of a huge frontier model into a 7B open model you can actually afford to host.

Example

# teacher: huge model, eval mode
# student: small model being trained
loss = alpha * KL(softmax(student_logits / T), softmax(teacher_logits / T)) * T**2 \
     + (1 - alpha) * cross_entropy(student_logits, labels)

You’ll hear it when

  • Shrinking a model for production serving.
  • Training an on-device or edge model.
  • Discussing “synthetic data” pipelines from a big model to a small one.
  • Comparing distillation vs quantization vs pruning.

Related terms

See also