SQL for ML Engineers
Before you train a model you have to get the data, and the data lives in a warehouse. The SQL you actually need — joins, window functions, CTEs, and not dying on skew.
Why this matters more than you think
You will spend 40% of your time wrangling training data from a warehouse. If your SQL is weak, you will either waste days waiting on queries, or worse, ship a model on silently wrong data.
The 10 things you must know cold
1. SELECT / FROM / WHERE
The basics. WHERE runs before aggregation, HAVING runs after.
2. JOINs
INNER— only matching rowsLEFT— all rows from left, matching from right (this is the one you want 80% of the time)FULL OUTER— rare, but realCROSS— cartesian, usually a bug
3. GROUP BY + aggregates
SELECT country, COUNT(*) AS users, AVG(orders) AS avg_orders
FROM customers
GROUP BY country
HAVING COUNT(*) > 100;
4. Window functions
The feature that separates casual from serious SQL users.
SELECT
user_id,
event_time,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) AS rn
FROM events
QUALIFY rn = 1; -- most recent event per user
Also: RANK, DENSE_RANK, LAG, LEAD, SUM() OVER, AVG() OVER with frame clauses.
5. CTEs (WITH clauses)
Break big queries into named steps. Much more readable than nested subqueries.
WITH active_users AS (
SELECT user_id FROM events WHERE ts > now() - interval '7 days'
),
purchases AS (
SELECT user_id, SUM(amount) AS spend FROM orders GROUP BY user_id
)
SELECT a.user_id, p.spend
FROM active_users a
LEFT JOIN purchases p USING (user_id);
6. Date and time
date_trunc('day', ts), ts::date, extract(hour from ts), interval arithmetic. Every warehouse has slightly different syntax — read the docs.
7. NULL handling
NULL = NULL is NULL, not TRUE. Use IS NULL, COALESCE, NULLIF.
8. Deduplication
SELECT DISTINCT user_id FROM events;
-- or, when you want the full row, pick one with a window function:
SELECT * EXCEPT(rn) FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) rn
FROM events
) WHERE rn = 1;
9. EXPLAIN / query plans
EXPLAIN ANALYZE <query> tells you what the warehouse is actually doing. If the estimate and the reality are wildly different, your stats are stale.
10. Sampling
SELECT * FROM huge_table TABLESAMPLE SYSTEM (1); -- 1%
Never run a new query on full prod data the first time. Sample, inspect, then scale.
The gotchas specific to ML work
- Data leakage through time — when building training data, always filter features to “what would have been known at time T”. A JOIN without a time predicate is how you ship a model that memorized the future.
- Skew — one user with 10M events will blow up your GROUP BY.
LIMITin your CTEs during dev. - Floats in WHERE clauses —
WHERE price = 9.99often misses rows because of float precision. Use ranges or round.
What to skip
Stored procedures. Triggers. Cursor-based loops. Recursive CTEs unless you hit a tree problem. Most of the SQL-89 stuff.