XGBoost
High-performance gradient-boosted decision tree library. The default strong baseline for tabular data.
Category
Deep Learning Frameworks
Difficulty
Beginner
When to use
Any supervised learning problem on tabular data — especially classification, regression, and ranking.
When not to use
Unstructured data (images, audio, text) where deep learning wins by a wide margin.
Alternatives
LightGBM CatBoost scikit-learn GBM
At a glance
| Field | Value |
|---|---|
| Category | Gradient boosting library |
| Difficulty | Beginner → Intermediate |
| When to use | Tabular supervised learning, ranking |
| When not to use | Images, audio, text, graphs |
| Alternatives | LightGBM, CatBoost |
What it is
XGBoost trains an ensemble of shallow decision trees with second-order gradient information, regularization on tree complexity, and a fast histogram-based split finder. It handles missing values natively and is well-behaved on both small and large datasets.
When we reach for it at Ephizen
- First baseline on any tabular problem, before anything fancier.
- Churn, conversion, credit, and fraud-style classification.
- Ranking features by importance or SHAP value for stakeholder-facing explanations.
- Small structured components inside larger systems where latency and interpretability matter.
Getting started
from xgboost import XGBClassifier
clf = XGBClassifier(
n_estimators=500,
max_depth=6,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
eval_metric="logloss",
early_stopping_rounds=25,
)
clf.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
Gotchas
- Default
n_estimatorsis often too low. Use early stopping on a validation set instead of tuning it by hand. - Categorical features: pass
enable_categorical=Trueand use pandas category dtype, or one-hot encode. - LightGBM is usually faster on huge datasets; CatBoost handles categoricals more elegantly. Benchmark all three on your data.
- SHAP values are built in — use them for local explanations, not
feature_importances_alone.
Related tools
- PyTorchThe dominant deep learning framework. Dynamic graphs, great debugging, and the de facto standard for research and most production ML.
- scikit-learnThe classical ML library for Python. Consistent API over dozens of algorithms for regression, classification, clustering, and preprocessing.
- TensorFlowGoogle's deep learning framework. Still widely deployed in production, especially via TF Serving, TFLite, and TF.js.