F1 Score

F1
Eval

The harmonic mean of precision and recall — a single number that punishes you for being lopsided on either.


In one line

The harmonic mean of precision and recall — a single number that punishes you for being lopsided on either.

What it actually means

F1 = 2 * P * R / (P + R). Because it’s the harmonic mean, F1 collapses fast if either precision or recall is low: a model with P = 1.0 and R = 0.05 has F1 ≈ 0.10, not 0.5. F1 weights the two equally. If you care about one more than the other, you use the general F_beta with beta > 1 to favour recall or beta < 1 to favour precision.

Why it matters

F1 is the default summary stat for imbalanced classification, especially in NER, information retrieval, and many medical tasks. It’s not always the right thing — averaging F1 across classes (macro vs micro vs weighted) gives different stories — but it’s a single, comparable number that nobody can game by guessing the majority class.

Example

precision = 0.70
recall    = 0.78
F1        = 2 * 0.70 * 0.78 / (0.70 + 0.78) ≈ 0.738

You’ll hear it when

  • Reporting classification results in a PR or paper.
  • Comparing models on a leaderboard (GLUE, SuperGLUE, BEIR).
  • Picking macro vs micro F1 for a multi-class task.
  • Defending why accuracy is the wrong number for an imbalanced dataset.
  • Evaluating NER, span extraction, or token classification.

Related terms