Perplexity

PPL
Eval

An intrinsic LLM metric — the exponentiated average negative log-likelihood the model assigns to held-out text. Lower is better.


In one line

An intrinsic LLM metric — the exponentiated average negative log-likelihood the model assigns to held-out text. Lower is better.

What it actually means

Perplexity measures how surprised a language model is by a test corpus. Mathematically it’s exp(mean(-log p(token | context))). Intuitively, a perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 options for each token. It’s a clean training-time signal because you can compute it on raw text — no labels, no graders, no taste — but it only compares fairly between models that share a tokenizer.

Why it matters

Perplexity is what you watch during pretraining and what scaling-law papers plot. It tracks model quality reasonably well at scale but stops being a useful product metric: a model with great perplexity can still hallucinate, refuse, or write robotic prose. For anything user-facing, you need task-level evals on top.

Example

PPL = exp(- (1/N) * Σ log p(x_i | x_<i) )

If a model assigns log-probability -2.3 on average per token, PPL = exp(2.3) ≈ 9.97.

You’ll hear it when

  • Reading pretraining or scaling-law papers.
  • Comparing two checkpoints from the same training run.
  • Evaluating a quantized model against its full-precision baseline.
  • Diagnosing distribution shift on a domain corpus.
  • Explaining why a great PPL number doesn’t translate to a great chatbot.

Related terms