F1 Score
F1The harmonic mean of precision and recall — a single number that punishes you for being lopsided on either.
In one line
The harmonic mean of precision and recall — a single number that punishes you for being lopsided on either.
What it actually means
F1 = 2 * P * R / (P + R). Because it’s the harmonic mean, F1 collapses fast if either precision or recall is low: a model with P = 1.0 and R = 0.05 has F1 ≈ 0.10, not 0.5. F1 weights the two equally. If you care about one more than the other, you use the general F_beta with beta > 1 to favour recall or beta < 1 to favour precision.
Why it matters
F1 is the default summary stat for imbalanced classification, especially in NER, information retrieval, and many medical tasks. It’s not always the right thing — averaging F1 across classes (macro vs micro vs weighted) gives different stories — but it’s a single, comparable number that nobody can game by guessing the majority class.
Example
precision = 0.70
recall = 0.78
F1 = 2 * 0.70 * 0.78 / (0.70 + 0.78) ≈ 0.738
You’ll hear it when
- Reporting classification results in a PR or paper.
- Comparing models on a leaderboard (GLUE, SuperGLUE, BEIR).
- Picking macro vs micro F1 for a multi-class task.
- Defending why accuracy is the wrong number for an imbalanced dataset.
- Evaluating NER, span extraction, or token classification.