SQL for ML Engineers

Before you train a model you have to get the data, and the data lives in a warehouse. The SQL you actually need — joins, window functions, CTEs, and not dying on skew.

SQL & Databases beginner #sql #data #warehouse #analytics
Prereqs: Basic logic

Why this matters more than you think

You will spend 40% of your time wrangling training data from a warehouse. If your SQL is weak, you will either waste days waiting on queries, or worse, ship a model on silently wrong data.

The 10 things you must know cold

1. SELECT / FROM / WHERE

The basics. WHERE runs before aggregation, HAVING runs after.

2. JOINs

  • INNER — only matching rows
  • LEFT — all rows from left, matching from right (this is the one you want 80% of the time)
  • FULL OUTER — rare, but real
  • CROSS — cartesian, usually a bug

3. GROUP BY + aggregates

SELECT country, COUNT(*) AS users, AVG(orders) AS avg_orders
FROM customers
GROUP BY country
HAVING COUNT(*) > 100;

4. Window functions

The feature that separates casual from serious SQL users.

SELECT
  user_id,
  event_time,
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) AS rn
FROM events
QUALIFY rn = 1;  -- most recent event per user

Also: RANK, DENSE_RANK, LAG, LEAD, SUM() OVER, AVG() OVER with frame clauses.

5. CTEs (WITH clauses)

Break big queries into named steps. Much more readable than nested subqueries.

WITH active_users AS (
  SELECT user_id FROM events WHERE ts > now() - interval '7 days'
),
purchases AS (
  SELECT user_id, SUM(amount) AS spend FROM orders GROUP BY user_id
)
SELECT a.user_id, p.spend
FROM active_users a
LEFT JOIN purchases p USING (user_id);

6. Date and time

date_trunc('day', ts), ts::date, extract(hour from ts), interval arithmetic. Every warehouse has slightly different syntax — read the docs.

7. NULL handling

NULL = NULL is NULL, not TRUE. Use IS NULL, COALESCE, NULLIF.

8. Deduplication

SELECT DISTINCT user_id FROM events;
-- or, when you want the full row, pick one with a window function:
SELECT * EXCEPT(rn) FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) rn
  FROM events
) WHERE rn = 1;

9. EXPLAIN / query plans

EXPLAIN ANALYZE <query> tells you what the warehouse is actually doing. If the estimate and the reality are wildly different, your stats are stale.

10. Sampling

SELECT * FROM huge_table TABLESAMPLE SYSTEM (1);  -- 1%

Never run a new query on full prod data the first time. Sample, inspect, then scale.

The gotchas specific to ML work

  • Data leakage through time — when building training data, always filter features to “what would have been known at time T”. A JOIN without a time predicate is how you ship a model that memorized the future.
  • Skew — one user with 10M events will blow up your GROUP BY. LIMIT in your CTEs during dev.
  • Floats in WHERE clausesWHERE price = 9.99 often misses rows because of float precision. Use ranges or round.

What to skip

Stored procedures. Triggers. Cursor-based loops. Recursive CTEs unless you hit a tree problem. Most of the SQL-89 stuff.