Track · Active

AI Engineer

Four-phase track

Intro

15 topics

Builds LLM-powered products. Strong at API integration, RAG, agents, and shipping reliable AI features into real software.

4 phases · 77 topics · 12 exercises · updated 2026-04-01

What the role is

Between a backend engineer and a researcher

An AI Engineer sits between a backend engineer and a machine learning researcher. You don't train foundation models — you wire them into real products. The job is to take a vague product ask ("let users search our docs with natural language"), pick the right LLM, design the retrieval pipeline, evaluate it, ship it behind an API, and keep it fast and cheap in production. Expect to spend your day in Python, reading model docs, debugging prompts, tuning retrieval, watching latency dashboards, and arguing about which eval actually measures what users care about.

On the job

What you actually do

Build and ship LLM-powered features end

Build and ship LLM-powered features end to end — from prompt to API to deployed service.

Design RAG pipelines over real company

Design RAG pipelines over real company data — chunking, embedding, hybrid search, reranking, evaluation.

Write and debug agents that call

Write and debug agents that call tools, handle failures, and stay within cost budgets.

Define evaluation suites for non-deterministic outputs

Define evaluation suites for non-deterministic outputs and catch regressions before users do.

Harden prompts and tool interfaces against

Harden prompts and tool interfaces against prompt injection, data leaks, and PII mishandling.

Optimize latency, cost, and token usage

Optimize latency, cost, and token usage across the request path (caching, streaming, batching).

Choose between proprietary APIs (OpenAI, Anthropic)

Choose between proprietary APIs (OpenAI, Anthropic) and open-weight models (Llama, Mistral) per use case.

Pair with product and design on

Pair with product and design on what the model should and shouldn't do.

Run on-call for the AI platform

Run on-call for the AI platform — triage hallucinations, throttling, API outages, and spend spikes.

How the week looks

A week in the life

Monday

9:30 AM Take a customer support Slack export, build a RAG pipeline that answers tier-1 questions with citations, and ship it as a FastAPI endpoint behind auth.

2:00 PM Add function calling to an existing chatbot so it can create Linear tickets, with guardrails that refuse to delete.

Tuesday

10:15 AM Write an eval suite of 100 prompts with expected tool calls, run it on every PR, and block merges that regress.

3:30 PM Cut inference cost on a RAG app by 40% without hurting answer quality — measure both.

Wednesday

11:00 AM Investigate why the agent keeps re-calling the same tool three times. Fix the loop.

Thursday

9:45 AM Swap Chroma for pgvector in staging, measure recall on the golden set, decide whether to migrate.

2:45 PM Add prompt caching to the 4 highest-traffic prompts and measure latency delta.

Friday

10:30 AM Red-team the new support bot with 20 injection attempts. File bugs for every one that succeeds.

Ownership

Projects you might own

Internal doc Q&A

A RAG-powered assistant over Confluence + Notion + engineering wiki, with source citations and an eval suite tied to a golden question set.

Tech stack: FastAPI · pgvector · OpenAI SDK · Ragas

Support triage agent

An agent that reads incoming support tickets, classifies urgency, drafts a reply, and opens a Linear issue — with a human approval step before anything goes out.

Tech stack: LangGraph · Anthropic SDK · Linear API · Langfuse

Sales call summarizer

Takes a Zoom transcript, extracts action items, writes a follow-up email in the rep's voice, and logs it to CRM. Includes a "did the model make this up?" check.

Tech stack: Whisper · Instructor · Pydantic · Postgres

Product search v2

Replaces keyword search with hybrid semantic + BM25 search over the catalog, with reranking and feedback loops from click data.

Tech stack: pgvector · OpenSearch · Cohere Rerank · Redis

Code review bot

A GitHub-integrated bot that reads PRs, flags obvious bugs and style issues, and suggests fixes — with a hard rule that it never approves, only comments.

Tech stack: GitHub Actions · Claude API · tree-sitter · DSPy

Phases 01 Intro 02 Junior 03 Intermediate 04 Senior

Intro phase

Intro — Prerequisites

The minimum baseline to read AI/ML docs and write real Python.

Topics

Python Basics Linear Algebra Probability Calculus Refresher Big-O

Languages

Python 3.11+ Bash SQL

Tools

Git VS Code uv Jupyter

Concepts

Vectors Tokens Loss Function

Exercises

Implement cosine similarity from scratch using only Python lists.
Solve 20 easy LeetCode array and hashmap problems.
Reproduce a numpy dot-product in pure Python and benchmark it against numpy.

Experiments

→ Clone a HuggingFace model and call it from a local script. Measure latency.
→ Set up uv, create a venv, pin deps, and publish a toy package to TestPyPI.

Ponder

When does "just use Python" stop scaling?
Why do dot products show up in attention, in embeddings, and in PCA?

Junior phase

Junior — Working AI Engineer

You can call LLM APIs, build a basic RAG, and ship a FastAPI endpoint.

Topics

LLM APIs Embeddings RAG Chunking Prompt Engineering Chain of Thought

Languages

Python (async) TypeScript

Frameworks

FastAPI Pydantic LangChain LlamaIndex

Tools

OpenAI SDK Anthropic SDK Chroma pgvector Redis MongoDB

Concepts

Vector Database Reranker Context Window

Exercises

Build a RAG pipeline over your company's Notion export.
Write a FastAPI endpoint that accepts a file upload, embeds it, and stores it in pgvector.
Measure and reduce latency on a 3-step LLM chain using prompt caching.

Experiments

→ Swap Chroma for pgvector and measure recall on the same queries.
→ Try 4 chunking strategies and eval with Ragas. Which wins on your data?

Ponder

Is RAG really the right answer, or would fine-tuning be cheaper?
What breaks first when your corpus grows 100x?

Intermediate phase

Intermediate — Ships Production AI

Agents, tool use, evals, observability, and cost control under real traffic.

Topics

Function Calling Structured Outputs Agent Loop ReAct MCP Evaluation Prompt Injection Observability

Languages

Python (typed)

Frameworks

LangGraph CrewAI Instructor DSPy

Tools

LangSmith Langfuse Ragas Helicone Kafka

Concepts

Tool Use Self-Consistency Guardrails

Exercises

Build an agent that can book a meeting by calling 3 real tools.
Add an eval suite with 50 hand-written test cases and track regression.
Instrument a RAG app with LangSmith and find the slowest span.

Experiments

→ Compare ReAct vs. Planner-Executor on the same task. Which hallucinates more?
→ Reproduce a known prompt injection against your own agent. Patch it.

Ponder

Where does the agent abstraction leak, and how do you contain it?
What's the right eval when the "correct" answer isn't a string?

Senior phase

Senior — Owns AI Platform

Architecture, cost, reliability, team leverage, and knowing what NOT to build.

Topics

Fine-Tuning LoRA Quantization Distillation Multi-tenancy Responsible AI Drift

Languages

Python + Rust

Frameworks

vLLM Ollama Ray BentoML

Tools

Databricks Snowflake MLflow Kubernetes Terraform

Concepts

Scaling Laws Speculative Decoding Model Registry

Exercises

Fine-tune a 7B model with LoRA on a domain-specific dataset and measure gains.
Build a cost-per-request dashboard that attributes spend per feature.
Design an on-call runbook for the AI platform.

Experiments

→ Run a shadow-traffic A/B between a proprietary and open-weight model. What's the actual quality gap?
→ Quantize a served model from fp16 to int8. Measure latency, memory, and quality delta.

Ponder

What's the smallest model that solves your actual problem?
When is the right moment to stop using an LLM at all?

Capstone

Ship the whole thing

Ship an end-to-end AI product — RAG over a real corpus, agent loop with tools, eval suite, and a deployed FastAPI backend.

What success looks like

Reproducible eval suite with at least 50 hand-written test cases that runs on every PR
Handles prompt injection attempts without leaking system prompts or private data
Answers include inline citations to the underlying source chunks
Cost per request tracked and budgeted, with a hard ceiling that trips an alert
p95 latency under 3 seconds end-to-end on the golden query set

This roadmap follows the AI Engineer track in four phases: from prerequisites through senior platform ownership. Each chip links to the wiki entry for that topic — and where the entry doesn’t exist yet, it points at the backlog so you know what’s coming next.