Logistic Regression

The simplest useful classifier, and a surprisingly strong baseline for most problems. Understand it deeply and you understand half of classical ML.

Classical ML beginner #classical-ml #classification #baseline #interview

Prereqs: Linear algebra, calculus basics

Why this is the first classifier to learn

Logistic regression is linear regression’s classification cousin. It’s fast, interpretable, scales to huge datasets, and is a strong baseline on tabular data. If your fancy gradient-boosted model only beats logistic regression by 1%, you probably don’t need the fancy model.

The mental model

For a binary classification problem, logistic regression learns a weight vector w and a bias b. For a new input x:

Compute a linear score: z = w · x + b
Squash through sigmoid: p = 1 / (1 + e^(-z))
If p > 0.5, predict class 1.

That’s it. The whole model is w and b.

What it learns

w is a vector of weights, one per feature. A large positive w[i] means feature i pushes the prediction toward class 1. A large negative one pushes toward class 0. A weight near zero means the feature doesn’t matter.

This is why logistic regression is interpretable: you can literally read off which features matter and in which direction.

How it’s trained

Maximum likelihood. The loss is cross-entropy (also called log loss):

L = -Σ [y log(p) + (1-y) log(1-p)]

Minimized with gradient descent. Because the loss surface is convex, gradient descent finds the global optimum — no local minima to worry about. This is a property you lose the moment you add a hidden layer.

The knobs that matter

Regularization — L1 encourages sparse weights (feature selection baked in), L2 shrinks all weights smoothly. Most libraries default to L2. In scikit-learn: penalty='l1' or 'l2', tune C (inverse regularization strength).
Class imbalance — if your positive class is 1% of data, the default will predict “no” for everything and claim 99% accuracy. Use class_weight='balanced' or resample.
Feature scaling — L2 regularization is sensitive to feature scale. Standardize your features before fitting.

When it wins

Tabular data with a few thousand to a few million rows
You need interpretability (medical, finance, hiring — regulated use cases)
You need fast inference (LR scores a row in microseconds)
You need a baseline to beat

When it loses

Non-linear decision boundaries (use trees or DL)
Strong feature interactions (use gradient boosting)
Images, audio, text (use DL)

The interview version

“Logistic regression is a linear model that predicts the log-odds of a binary outcome. It uses the sigmoid function to squash the linear score to a probability, and it’s trained by minimizing cross-entropy loss. It’s my default baseline because it’s fast, convex, and interpretable.”

Say that and you’ve cleared the bar on 90% of ML screens.