Probability for ML
The probability you actually need to reason about ML models — not the textbook full course, just the parts that show up in loss functions, sampling, and evaluation.
Why you need this
Every ML model you train is a probability distribution in disguise. Cross-entropy loss is the negative log-likelihood of the right class. Temperature sampling in LLMs is a literal probability temperature. When you don’t understand probability, you misread every eval metric and every model output.
What to actually learn
- Random variables and distributions — discrete vs continuous, PMF vs PDF. Get comfortable with Bernoulli, Binomial, Normal, and Categorical.
- Expectation and variance — what they mean, how they compose.
- Conditional probability and Bayes’ rule —
P(A|B) = P(B|A) P(A) / P(B). This is the whole foundation of Naive Bayes, Bayesian inference, and most modern eval frameworks. - Joint and marginal distributions — what happens when you have two variables and only care about one.
- Independence and covariance — why “correlation is not causation” is more than a meme.
- MLE vs MAP — why maximum likelihood estimation is the default training objective and when you’d pick a prior.
What to skip (for now)
Measure theory. Sigma algebras. Any chapter that starts with “Let Ω be a probability space.”
A concrete example
A spam classifier outputs 0.92 for an email. That’s not “92% confident it’s spam” in the colloquial sense — it’s the model’s estimate of P(spam | email features). If you want calibrated probabilities (the predicted 0.92 should actually mean spam 92% of the time across all 0.92-scored emails), you need to check calibration separately. Most deep models are over-confident out of the box.
Where it shows up in your day job
- Cross-entropy loss in PyTorch
- Temperature and top-p sampling in LLMs
- BLEU, ROUGE, and perplexity
- Bayesian A/B tests
- Uncertainty quantification in production (e.g., “this RAG answer has low retrieval confidence”)