AI Engineer
Builds LLM-powered products. Strong at API integration, RAG, agents, and shipping reliable AI features into real software.
Between a backend engineer and a researcher
An AI Engineer sits between a backend engineer and a machine learning researcher. You don't train foundation models — you wire them into real products. The job is to take a vague product ask ("let users search our docs with natural language"), pick the right LLM, design the retrieval pipeline, evaluate it, ship it behind an API, and keep it fast and cheap in production. Expect to spend your day in Python, reading model docs, debugging prompts, tuning retrieval, watching latency dashboards, and arguing about which eval actually measures what users care about.
What you actually do
Build and ship LLM-powered features end to end — from prompt to API to deployed service.
Design RAG pipelines over real company data — chunking, embedding, hybrid search, reranking, evaluation.
Write and debug agents that call tools, handle failures, and stay within cost budgets.
Define evaluation suites for non-deterministic outputs and catch regressions before users do.
Harden prompts and tool interfaces against prompt injection, data leaks, and PII mishandling.
Optimize latency, cost, and token usage across the request path (caching, streaming, batching).
Choose between proprietary APIs (OpenAI, Anthropic) and open-weight models (Llama, Mistral) per use case.
Pair with product and design on what the model should and shouldn't do.
Run on-call for the AI platform — triage hallucinations, throttling, API outages, and spend spikes.
A week in the life
Projects you might own
A RAG-powered assistant over Confluence + Notion + engineering wiki, with source citations and an eval suite tied to a golden question set.
An agent that reads incoming support tickets, classifies urgency, drafts a reply, and opens a Linear issue — with a human approval step before anything goes out.
Takes a Zoom transcript, extracts action items, writes a follow-up email in the rep's voice, and logs it to CRM. Includes a "did the model make this up?" check.
Replaces keyword search with hybrid semantic + BM25 search over the catalog, with reranking and feedback loops from click data.
A GitHub-integrated bot that reads PRs, flags obvious bugs and style issues, and suggests fixes — with a hard rule that it never approves, only comments.
Intro — Prerequisites
The minimum baseline to read AI/ML docs and write real Python.
- Implement cosine similarity from scratch using only Python lists.
- Solve 20 easy LeetCode array and hashmap problems.
- Reproduce a numpy dot-product in pure Python and benchmark it against numpy.
- → Clone a HuggingFace model and call it from a local script. Measure latency.
- → Set up uv, create a venv, pin deps, and publish a toy package to TestPyPI.
- When does "just use Python" stop scaling?
- Why do dot products show up in attention, in embeddings, and in PCA?
Junior — Working AI Engineer
You can call LLM APIs, build a basic RAG, and ship a FastAPI endpoint.
- Build a RAG pipeline over your company's Notion export.
- Write a FastAPI endpoint that accepts a file upload, embeds it, and stores it in pgvector.
- Measure and reduce latency on a 3-step LLM chain using prompt caching.
- → Swap Chroma for pgvector and measure recall on the same queries.
- → Try 4 chunking strategies and eval with Ragas. Which wins on your data?
- Is RAG really the right answer, or would fine-tuning be cheaper?
- What breaks first when your corpus grows 100x?
Intermediate — Ships Production AI
Agents, tool use, evals, observability, and cost control under real traffic.
- Build an agent that can book a meeting by calling 3 real tools.
- Add an eval suite with 50 hand-written test cases and track regression.
- Instrument a RAG app with LangSmith and find the slowest span.
- → Compare ReAct vs. Planner-Executor on the same task. Which hallucinates more?
- → Reproduce a known prompt injection against your own agent. Patch it.
- Where does the agent abstraction leak, and how do you contain it?
- What's the right eval when the "correct" answer isn't a string?
Senior — Owns AI Platform
Architecture, cost, reliability, team leverage, and knowing what NOT to build.
- Fine-tune a 7B model with LoRA on a domain-specific dataset and measure gains.
- Build a cost-per-request dashboard that attributes spend per feature.
- Design an on-call runbook for the AI platform.
- → Run a shadow-traffic A/B between a proprietary and open-weight model. What's the actual quality gap?
- → Quantize a served model from fp16 to int8. Measure latency, memory, and quality delta.
- What's the smallest model that solves your actual problem?
- When is the right moment to stop using an LLM at all?
Ship the whole thing
Ship an end-to-end AI product — RAG over a real corpus, agent loop with tools, eval suite, and a deployed FastAPI backend.
- Reproducible eval suite with at least 50 hand-written test cases that runs on every PR
- Handles prompt injection attempts without leaking system prompts or private data
- Answers include inline citations to the underlying source chunks
- Cost per request tracked and budgeted, with a hard ceiling that trips an alert
- p95 latency under 3 seconds end-to-end on the golden query set
This roadmap follows the AI Engineer track in four phases: from prerequisites through senior platform ownership. Each chip links to the wiki entry for that topic — and where the entry doesn’t exist yet, it points at the backlog so you know what’s coming next.