Logistic Regression
The simplest useful classifier, and a surprisingly strong baseline for most problems. Understand it deeply and you understand half of classical ML.
Why this is the first classifier to learn
Logistic regression is linear regression’s classification cousin. It’s fast, interpretable, scales to huge datasets, and is a strong baseline on tabular data. If your fancy gradient-boosted model only beats logistic regression by 1%, you probably don’t need the fancy model.
The mental model
For a binary classification problem, logistic regression learns a weight vector w and a bias b. For a new input x:
- Compute a linear score:
z = w · x + b - Squash through sigmoid:
p = 1 / (1 + e^(-z)) - If
p > 0.5, predict class 1.
That’s it. The whole model is w and b.
What it learns
w is a vector of weights, one per feature. A large positive w[i] means feature i pushes the prediction toward class 1. A large negative one pushes toward class 0. A weight near zero means the feature doesn’t matter.
This is why logistic regression is interpretable: you can literally read off which features matter and in which direction.
How it’s trained
Maximum likelihood. The loss is cross-entropy (also called log loss):
L = -Σ [y log(p) + (1-y) log(1-p)]
Minimized with gradient descent. Because the loss surface is convex, gradient descent finds the global optimum — no local minima to worry about. This is a property you lose the moment you add a hidden layer.
The knobs that matter
- Regularization —
L1encourages sparse weights (feature selection baked in),L2shrinks all weights smoothly. Most libraries default to L2. In scikit-learn:penalty='l1'or'l2', tuneC(inverse regularization strength). - Class imbalance — if your positive class is 1% of data, the default will predict “no” for everything and claim 99% accuracy. Use
class_weight='balanced'or resample. - Feature scaling — L2 regularization is sensitive to feature scale. Standardize your features before fitting.
When it wins
- Tabular data with a few thousand to a few million rows
- You need interpretability (medical, finance, hiring — regulated use cases)
- You need fast inference (LR scores a row in microseconds)
- You need a baseline to beat
When it loses
- Non-linear decision boundaries (use trees or DL)
- Strong feature interactions (use gradient boosting)
- Images, audio, text (use DL)
The interview version
“Logistic regression is a linear model that predicts the log-odds of a binary outcome. It uses the sigmoid function to squash the linear score to a probability, and it’s trained by minimizing cross-entropy loss. It’s my default baseline because it’s fast, convex, and interpretable.”
Say that and you’ve cleared the bar on 90% of ML screens.