Gradient Boosting (XGBoost, LightGBM, CatBoost)

The winner on tabular data for the last decade. How it works, which library to pick, and the three mistakes everyone makes tuning it.

Classical ML intermediate #classical-ml #xgboost #lightgbm #tabular

Prereqs: Decision trees, cross-validation

Why you keep hearing about this

On tabular data, gradient-boosted decision trees still beat deep learning in almost every head-to-head. Kaggle tabular competitions? XGBoost or LightGBM. Fraud detection in production? Probably a GBM. Your first serious ML model at a real company? Almost certainly this.

The mental model

A single decision tree is weak — it overfits or underfits, and its predictions are chunky.

Boosting trains trees sequentially, where each new tree focuses on the mistakes of the ensemble so far. After 500 trees, you have a very strong learner built from 500 very weak ones.

“Gradient” boosting specifically means each new tree is fit to the gradient of the loss with respect to the current predictions. For regression with squared loss, that’s just the residuals. For classification, it’s more complicated but the idea holds.

The three libraries

Library	Strength	When to pick
XGBoost	Battle-tested, ubiquitous, GPU support	Default choice, especially in prod
LightGBM	Faster training on big data, leaf-wise growth	Large datasets (>1M rows)
CatBoost	Handles categorical features natively, less tuning	Lots of high-cardinality categoricals

All three have nearly identical sklearn-compatible APIs. Pick one and learn it deeply.

The hyperparameters that actually matter

n_estimators (number of trees) — the most important knob. Start with 500-1000.
learning_rate (aka eta) — how much each tree contributes. Lower = more trees needed but better final result. 0.05-0.1 is a reasonable default.
max_depth — tree depth. 4-8 for most problems. Deeper trees overfit.
min_child_weight / min_samples_leaf — minimum samples per leaf. Raise this to combat overfitting.
subsample and colsample_bytree — sample rows and columns per tree. Both in [0.7, 1.0] usually.
reg_alpha (L1) / reg_lambda (L2) — regularization. Mostly leave at defaults unless you see overfitting.

The three mistakes everyone makes

1. Tuning n_estimators by grid search. Wrong. Use early stopping. Train a big number (like 5000), pass a validation set, and stop when val loss stops improving. Way faster and better.

2. Not holding out a true test set. Doing CV for hyperparameter tuning AND reporting CV scores as your final number. Your test set needs to be untouched until the very end, or you’re reporting optimistic numbers.

3. Encoding categoricals with label encoding. For XGBoost/LightGBM, use one-hot for low-cardinality and target encoding for high-cardinality. For CatBoost, just pass them as categorical features directly — that’s its superpower.

What it doesn’t handle well

Text (use embeddings first, then feed to the GBM)
Images (stick with CNNs)
Very high-dimensional sparse data (linear models or neural networks do better)
Data with strong time dependence (works, but careful feature engineering beats exotic models)

The production reality

GBMs are easy to deploy: small models (a few MB), fast inference (microseconds), and every major framework exports to ONNX or a self-contained binary. The ops story is dramatically easier than deep learning.