KL Divergence
Kullback–Leibler divergenceAn asymmetric measure of how much one probability distribution differs from another — zero when they match, larger as they diverge.
In one line
How much extra information you need on average to describe samples from P using a code optimized for Q.
What it actually means
For discrete distributions, KL(P || Q) = sum_x P(x) log(P(x) / Q(x)). It’s zero when P equals Q and positive otherwise. It’s not a distance — it’s asymmetric, KL(P || Q) != KL(Q || P), and doesn’t satisfy the triangle inequality. Minimizing KL(P || Q) with respect to Q makes Q cover the support of P (mode-covering); minimizing KL(Q || P) makes Q lock onto one mode (mode-seeking). Both show up in ML.
Why it matters
Half the ML losses you use are KL in disguise. Cross-entropy loss is KL plus a constant. Variational autoencoders regularize the latent code with a KL term to a prior. RLHF’s PPO objective uses a KL penalty to keep the fine-tuned model near the base model. Knowledge distillation uses KL between teacher and student distributions.
Example
import torch.nn.functional as F
# KL(student || teacher) with temperature T
loss = F.kl_div(
F.log_softmax(student_logits / T, dim=-1),
F.softmax(teacher_logits / T, dim=-1),
reduction="batchmean",
) * (T ** 2)
You’ll hear it when
- Reading any paper on distillation, VAEs, or RLHF.
- Deriving cross-entropy from first principles in an interview.
- Debating mode-covering vs mode-seeking training.
- Tuning the KL coefficient in PPO or DPO.