Databricks

Unified data platform built on Apache Spark, Delta Lake, and MLflow. Strong for ML training and feature engineering on big data.

Category
Data Warehouses
Difficulty
Intermediate
When to use
You're running Spark-scale ETL, training models on terabytes, and want notebook + job + MLflow in one place.
When not to use
Your workload fits in Postgres, your team wants BI-first governance, or you're allergic to JVM config.
Alternatives
Snowflake BigQuery AWS EMR Ray on Kubernetes

At a glance

FieldValue
CategoryData Platform / Lakehouse
DifficultyIntermediate
When to useSpark-scale ETL, distributed training, feature engineering on big data
When not to useSmall data, BI-first teams, no JVM/Spark skills in-house
AlternativesSnowflake, BigQuery, AWS EMR, Ray on Kubernetes

Mental model: workspace → cluster → job

Databricks has three objects you’ll touch daily:

  • Workspace — the logical tenant. It holds notebooks, jobs, catalogs, secrets, and users. One workspace per environment is a reasonable default (dev / staging / prod).
  • Cluster — a Spark runtime (driver + workers) spun up on VMs. Two flavors: all-purpose clusters (shared, interactive, expensive to leave running) and job clusters (ephemeral, spun up per job, cheaper).
  • Job — a scheduled or triggered run of notebooks, Python wheels, or Delta Live Tables pipelines. Jobs are how you put notebooks into production without copying code.

Rule of thumb: notebooks for exploration, job clusters + wheel tasks for anything that runs twice.

Unity Catalog

Unity Catalog is the governance layer. It gives you a three-level namespace: catalog.schema.table. It’s the right place to enforce row/column level security, lineage, and audit. If you’re starting fresh in 2026, enable Unity Catalog on day one — migrating later is painful.

Delta Lake basics

Delta is the storage format under the lakehouse. It’s Parquet plus a transaction log (_delta_log/). What you get:

  • ACID writes — multiple writers won’t corrupt a table.
  • Time travelSELECT * FROM t VERSION AS OF 42.
  • Schema evolution — add columns without rewriting.
  • MERGE (upsert) — the upsert pattern that plain Parquet can’t do.
  • OPTIMIZE / ZORDER — compact small files, cluster on query columns.

Two commands you’ll run weekly:

OPTIMIZE events ZORDER BY (user_id);
VACUUM events RETAIN 168 HOURS;

Databricks vs Snowflake — how we pick

DimensionDatabricks wins when…Snowflake wins when…
ML trainingYou need GPUs, MLflow, Spark ML, PyTorch distributedYou only need in-warehouse UDFs
Unstructured dataImages, audio, nested JSON, streamingNot your use case
Pure SQL analytics / BIAcceptable with SQL WarehousesBest-in-class concurrency and zero-config
Governance / complianceUnity Catalog is good, setup-heavyMature, simpler RBAC
Team skillsetComfortable with Spark/PythonSQL-first analysts

The short version: Databricks for ML and big data ETL, Snowflake for analytics and BI. Most orgs end up with both.

MLflow integration

MLflow is built in. Every notebook run is auto-logged with parameters, metrics, and artifacts. The Model Registry lives in Unity Catalog and promotes models through stages (None → Staging → Production). Serving is one click from a registered model to a REST endpoint.

import mlflow

mlflow.set_experiment("/Shared/ephizen/churn")
with mlflow.start_run():
    mlflow.log_param("max_depth", 8)
    model = train(...)
    mlflow.sklearn.log_model(model, "model",
        registered_model_name="main.ml.churn")

How Ephizen uses it

  • Feature engineering on Spark. Heavy joins and windowed aggregations across our events log live as Delta tables, refreshed by a Databricks Workflow on a job cluster.
  • Training. XGBoost and PyTorch jobs read features from Delta, log to MLflow, register the winner to Unity Catalog.
  • Serving goes elsewhere — we pull the MLflow artifact into a FastAPI container running on ECS. Databricks Model Serving is convenient but pricier than rolling our own for high-traffic models.

Cost gotchas

  • Leaving an all-purpose cluster running overnight is the single biggest waste. Set auto-termination to 30 minutes.
  • Photon is faster but billed at a higher DBU rate — only turn it on for production SQL workloads where the speedup pays for itself.
  • Job clusters finish and disappear; prefer them over all-purpose for any scheduled work.

Related tools