Snowflake
Cloud data warehouse with separated storage and compute. SQL-first, low-ops, and the BI team's favorite.
At a glance
| Field | Value |
|---|---|
| Category | Data Warehouse |
| Difficulty | Beginner (SQL) → Intermediate (Snowpark / governance) |
| When to use | Analytics, BI, governed reporting, ELT output, embedded analytics |
| When not to use | Heavy ML training, unstructured data, low-latency OLTP |
| Alternatives | Databricks, BigQuery, Redshift, DuckDB |
Virtual warehouses
A “warehouse” in Snowflake is not the data — it’s the compute. Storage lives in S3/GCS/Azure Blob, managed by Snowflake. Compute is a stateless cluster you spin up on demand.
- Sized T-shirt style:
XS→6XL. Each step doubles nodes and cost. - Auto-suspend when idle (default 10 min). Auto-resume on first query.
- Independent warehouses can query the same data concurrently — the ETL job
on a
Lwarehouse doesn’t fight the BI dashboards on aXS.
This separation is the Snowflake superpower: you never pay for compute you don’t use, and you never queue behind another team.
Micro-partitions
Snowflake stores tables as immutable ~16 MB compressed columnar micro-partitions. It automatically tracks per-partition min/max on every column. That metadata powers pruning — most queries only scan a tiny fraction of the data.
You don’t create or manage these. You influence pruning by:
- Clustering keys on very large tables that need predictable pruning.
- Filtering on high-cardinality columns the optimizer already indexed.
- Avoiding
SELECT *on wide tables — columnar scan cost scales with columns read.
Check pruning with:
SELECT * FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY_BY_SESSION())
ORDER BY START_TIME DESC;
and look at PARTITIONS_SCANNED vs PARTITIONS_TOTAL.
Snowpark
Snowpark lets you run Python, Java, or Scala inside Snowflake warehouses. DataFrame API that looks like PySpark, with UDFs and stored procedures that execute on the warehouse compute. Handy for feature engineering when you don’t want to ship data out to a separate cluster.
Caveats: you’re constrained to what the Snowpark Python runtime ships (no arbitrary C extensions, limited package set), and heavy ML training is still better done elsewhere.
When Snowflake wins
- Concurrent BI dashboards. 50 analysts hitting Tableau simultaneously? Spin up a dedicated warehouse for BI and the rest of your workloads are unaffected.
- Governance and compliance. Row-access policies, masking policies, object tagging, and Horizon Catalog are mature. Auditors like Snowflake.
- Data sharing. Share a live dataset with a partner across Snowflake accounts without copying it.
- Zero-copy clones.
CREATE TABLE staging CLONE prod;is instant. Great for dev environments off prod data.
When Snowflake loses
- Heavy ML training. Snowpark ML exists, but for real DL training on GPUs, Databricks or a dedicated training stack wins.
- Unstructured data. Images, audio, binary blobs: tolerable via directory tables, not pleasant.
- Streaming. Snowpipe and dynamic tables are catching up but a Kafka → Spark → Delta pipeline still beats them for sub-minute latency.
- Cost control under sloppy usage.
SELECT *with no filters on a big table on a 2XL warehouse is an expensive mistake. Resource monitors and query tags are mandatory, not optional.
Cost model in one paragraph
You pay for storage (cheap, per TB-month) and compute (per-second
billing on the running warehouse, charged in credits). Credits cost
depends on edition (Standard → Business Critical) and cloud region. A
warehouse at rest costs nothing; a warehouse at XL running continuously
is the fastest way to burn budget. Set auto-suspend aggressively, use
resource monitors, tag queries by team, and review top-N expensive queries
weekly.