Statistics for ML Engineers

The statistics you need to trust your eval numbers — hypothesis testing, confidence intervals, and the fact that one run is never enough.

Mathematics intermediate #math #statistics #hypothesis-testing #evaluation
Prereqs: Basic probability

Why you need this

You’ll spend half your ML life running experiments. Without statistics, you can’t tell the difference between “our new model is better” and “we got lucky on one eval run.” This is the difference between shipping a real improvement and shipping noise.

Core ideas

  • Sampling and populations — what your eval set represents, and what it doesn’t.
  • Mean, median, variance, std — don’t just compute, understand which one to report and why.
  • Standard error of the meanstd / sqrt(n). This is the single most-ignored number in ML papers.
  • Confidence intervals — when someone says “our model scores 87.3”, they usually mean “87.3 ± something we didn’t bother computing.”
  • Hypothesis testing — null vs alternative, p-values, and why p < 0.05 is a convention, not a law of physics.
  • Paired vs unpaired tests — use paired when you’re comparing two models on the same eval items (almost always).
  • Multiple testing correction — if you try 20 prompts and one looks 5% better, it’s probably chance.
  • Bootstrap resampling — the lazy statistician’s confidence interval. Works when you can’t do the math.

The three mistakes everyone makes

  1. Reporting one number, no variance. “Our model scores 0.847.” On what seed? Across how many runs? What’s the std?
  2. Evaluating on the training set by accident. Data leakage through near-duplicates, time-based contamination, or reusing the val set as test.
  3. Chasing p-values without effect size. A statistically significant 0.001 improvement on a benchmark nobody cares about is not a win.

A concrete example

You compare prompts A and B on 200 examples. A scores 72%, B scores 75%. Is B actually better?

  • Paired bootstrap: resample the 200 examples 10,000 times with replacement, score both prompts on each resample, measure the distribution of (B - A).
  • If 95% of those differences are positive, you’ve got a real signal.
  • If the CI crosses zero, you have noise. Get more eval data or accept you don’t know.

Where it shows up

  • Benchmarks, leaderboards, paper tables
  • A/B tests in production
  • Model comparison dashboards
  • RAG eval with Ragas
  • Any sentence containing “improved by X%”