Databricks
Unified data platform built on Apache Spark, Delta Lake, and MLflow. Strong for ML training and feature engineering on big data.
At a glance
| Field | Value |
|---|---|
| Category | Data Platform / Lakehouse |
| Difficulty | Intermediate |
| When to use | Spark-scale ETL, distributed training, feature engineering on big data |
| When not to use | Small data, BI-first teams, no JVM/Spark skills in-house |
| Alternatives | Snowflake, BigQuery, AWS EMR, Ray on Kubernetes |
Mental model: workspace → cluster → job
Databricks has three objects you’ll touch daily:
- Workspace — the logical tenant. It holds notebooks, jobs, catalogs, secrets, and users. One workspace per environment is a reasonable default (dev / staging / prod).
- Cluster — a Spark runtime (driver + workers) spun up on VMs. Two flavors: all-purpose clusters (shared, interactive, expensive to leave running) and job clusters (ephemeral, spun up per job, cheaper).
- Job — a scheduled or triggered run of notebooks, Python wheels, or Delta Live Tables pipelines. Jobs are how you put notebooks into production without copying code.
Rule of thumb: notebooks for exploration, job clusters + wheel tasks for anything that runs twice.
Unity Catalog
Unity Catalog is the governance layer. It gives you a three-level namespace:
catalog.schema.table. It’s the right place to enforce row/column level
security, lineage, and audit. If you’re starting fresh in 2026, enable Unity
Catalog on day one — migrating later is painful.
Delta Lake basics
Delta is the storage format under the lakehouse. It’s Parquet plus a
transaction log (_delta_log/). What you get:
- ACID writes — multiple writers won’t corrupt a table.
- Time travel —
SELECT * FROM t VERSION AS OF 42. - Schema evolution — add columns without rewriting.
- MERGE (upsert) — the upsert pattern that plain Parquet can’t do.
- OPTIMIZE / ZORDER — compact small files, cluster on query columns.
Two commands you’ll run weekly:
OPTIMIZE events ZORDER BY (user_id);
VACUUM events RETAIN 168 HOURS;
Databricks vs Snowflake — how we pick
| Dimension | Databricks wins when… | Snowflake wins when… |
|---|---|---|
| ML training | You need GPUs, MLflow, Spark ML, PyTorch distributed | You only need in-warehouse UDFs |
| Unstructured data | Images, audio, nested JSON, streaming | Not your use case |
| Pure SQL analytics / BI | Acceptable with SQL Warehouses | Best-in-class concurrency and zero-config |
| Governance / compliance | Unity Catalog is good, setup-heavy | Mature, simpler RBAC |
| Team skillset | Comfortable with Spark/Python | SQL-first analysts |
The short version: Databricks for ML and big data ETL, Snowflake for analytics and BI. Most orgs end up with both.
MLflow integration
MLflow is built in. Every notebook run is auto-logged with parameters,
metrics, and artifacts. The Model Registry lives in Unity Catalog and
promotes models through stages (None → Staging → Production). Serving is
one click from a registered model to a REST endpoint.
import mlflow
mlflow.set_experiment("/Shared/ephizen/churn")
with mlflow.start_run():
mlflow.log_param("max_depth", 8)
model = train(...)
mlflow.sklearn.log_model(model, "model",
registered_model_name="main.ml.churn")
How Ephizen uses it
- Feature engineering on Spark. Heavy joins and windowed aggregations across our events log live as Delta tables, refreshed by a Databricks Workflow on a job cluster.
- Training. XGBoost and PyTorch jobs read features from Delta, log to MLflow, register the winner to Unity Catalog.
- Serving goes elsewhere — we pull the MLflow artifact into a FastAPI container running on ECS. Databricks Model Serving is convenient but pricier than rolling our own for high-traffic models.
Cost gotchas
- Leaving an all-purpose cluster running overnight is the single biggest waste. Set auto-termination to 30 minutes.
- Photon is faster but billed at a higher DBU rate — only turn it on for production SQL workloads where the speedup pays for itself.
- Job clusters finish and disappear; prefer them over all-purpose for any scheduled work.