Perplexity
PPLAn intrinsic LLM metric — the exponentiated average negative log-likelihood the model assigns to held-out text. Lower is better.
In one line
An intrinsic LLM metric — the exponentiated average negative log-likelihood the model assigns to held-out text. Lower is better.
What it actually means
Perplexity measures how surprised a language model is by a test corpus. Mathematically it’s exp(mean(-log p(token | context))). Intuitively, a perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 options for each token. It’s a clean training-time signal because you can compute it on raw text — no labels, no graders, no taste — but it only compares fairly between models that share a tokenizer.
Why it matters
Perplexity is what you watch during pretraining and what scaling-law papers plot. It tracks model quality reasonably well at scale but stops being a useful product metric: a model with great perplexity can still hallucinate, refuse, or write robotic prose. For anything user-facing, you need task-level evals on top.
Example
PPL = exp(- (1/N) * Σ log p(x_i | x_<i) )
If a model assigns log-probability -2.3 on average per token, PPL = exp(2.3) ≈ 9.97.
You’ll hear it when
- Reading pretraining or scaling-law papers.
- Comparing two checkpoints from the same training run.
- Evaluating a quantized model against its full-precision baseline.
- Diagnosing distribution shift on a domain corpus.
- Explaining why a great PPL number doesn’t translate to a great chatbot.