KL Divergence

Kullback–Leibler divergence
Math

An asymmetric measure of how much one probability distribution differs from another — zero when they match, larger as they diverge.


In one line

How much extra information you need on average to describe samples from P using a code optimized for Q.

What it actually means

For discrete distributions, KL(P || Q) = sum_x P(x) log(P(x) / Q(x)). It’s zero when P equals Q and positive otherwise. It’s not a distance — it’s asymmetric, KL(P || Q) != KL(Q || P), and doesn’t satisfy the triangle inequality. Minimizing KL(P || Q) with respect to Q makes Q cover the support of P (mode-covering); minimizing KL(Q || P) makes Q lock onto one mode (mode-seeking). Both show up in ML.

Why it matters

Half the ML losses you use are KL in disguise. Cross-entropy loss is KL plus a constant. Variational autoencoders regularize the latent code with a KL term to a prior. RLHF’s PPO objective uses a KL penalty to keep the fine-tuned model near the base model. Knowledge distillation uses KL between teacher and student distributions.

Example

import torch.nn.functional as F
# KL(student || teacher) with temperature T
loss = F.kl_div(
    F.log_softmax(student_logits / T, dim=-1),
    F.softmax(teacher_logits / T, dim=-1),
    reduction="batchmean",
) * (T ** 2)

You’ll hear it when

  • Reading any paper on distillation, VAEs, or RLHF.
  • Deriving cross-entropy from first principles in an interview.
  • Debating mode-covering vs mode-seeking training.
  • Tuning the KL coefficient in PPO or DPO.

Related terms

See also