Probability for ML

The probability you actually need to reason about ML models — not the textbook full course, just the parts that show up in loss functions, sampling, and evaluation.

Mathematics beginner #math #probability #distributions #bayes
Prereqs: High school algebra

Why you need this

Every ML model you train is a probability distribution in disguise. Cross-entropy loss is the negative log-likelihood of the right class. Temperature sampling in LLMs is a literal probability temperature. When you don’t understand probability, you misread every eval metric and every model output.

What to actually learn

  • Random variables and distributions — discrete vs continuous, PMF vs PDF. Get comfortable with Bernoulli, Binomial, Normal, and Categorical.
  • Expectation and variance — what they mean, how they compose.
  • Conditional probability and Bayes’ ruleP(A|B) = P(B|A) P(A) / P(B). This is the whole foundation of Naive Bayes, Bayesian inference, and most modern eval frameworks.
  • Joint and marginal distributions — what happens when you have two variables and only care about one.
  • Independence and covariance — why “correlation is not causation” is more than a meme.
  • MLE vs MAP — why maximum likelihood estimation is the default training objective and when you’d pick a prior.

What to skip (for now)

Measure theory. Sigma algebras. Any chapter that starts with “Let Ω be a probability space.”

A concrete example

A spam classifier outputs 0.92 for an email. That’s not “92% confident it’s spam” in the colloquial sense — it’s the model’s estimate of P(spam | email features). If you want calibrated probabilities (the predicted 0.92 should actually mean spam 92% of the time across all 0.92-scored emails), you need to check calibration separately. Most deep models are over-confident out of the box.

Where it shows up in your day job

  • Cross-entropy loss in PyTorch
  • Temperature and top-p sampling in LLMs
  • BLEU, ROUGE, and perplexity
  • Bayesian A/B tests
  • Uncertainty quantification in production (e.g., “this RAG answer has low retrieval confidence”)