FastAPI

Async Python web framework built on Starlette and Pydantic. Our default choice for wrapping models as HTTP services.

Category
API & Serving
Difficulty
Beginner
When to use
You need a typed, async Python HTTP service — especially one that serves ML models, proxies LLM calls, or exposes a RAG pipeline.
When not to use
You need server-rendered HTML at scale, SSR, or a batteries-included admin — reach for Django. Pure static sites belong elsewhere.
Alternatives
Flask Django REST Framework Litestar BentoML

At a glance

FieldValue
CategoryAPI & Serving
DifficultyBeginner → Intermediate
When to useTyped async Python HTTP services; ML/LLM inference endpoints
When not to useServer-rendered HTML apps, admin panels, pure static sites
AlternativesFlask, Django REST Framework, Litestar, BentoML

What it is

FastAPI is a Python web framework that sits on top of two libraries:

  • Starlette for the ASGI server plumbing (routing, middleware, WebSockets).
  • Pydantic for request/response validation and serialization.

You write Python type hints. FastAPI turns them into request parsers, response serializers, and an OpenAPI schema that auto-generates docs at /docs. That single property — types are the contract — is why we reach for it first.

The async model in one paragraph

FastAPI runs under an ASGI server (uvicorn, hypercorn). Endpoints declared with async def are awaited on the event loop; endpoints declared with def are offloaded to a thread pool. Use async def when you’re awaiting I/O (HTTP, DB drivers that support async, LLM SDKs). Use plain def when the work is CPU-bound or the library is blocking (e.g. a sync database driver). Mixing the two is fine and common.

Pydantic validation

Request bodies, query params, and responses are declared as Pydantic models. If the caller sends garbage, FastAPI returns a 422 with a precise JSON error before your handler ever runs. This is the main reason we avoid Flask for model-serving work: the validation boilerplate just disappears.

from pydantic import BaseModel, Field

class PredictRequest(BaseModel):
    text: str = Field(min_length=1, max_length=8000)
    threshold: float = Field(0.5, ge=0, le=1)

class PredictResponse(BaseModel):
    label: str
    score: float

Dependency injection

FastAPI’s Depends() is a lightweight DI container. Use it for:

  • Loading the model once at startup and handing it to every request.
  • Auth (API keys, JWTs) — declare Depends(get_current_user) and the handler only runs if the dep resolves.
  • DB sessions with proper cleanup.
from fastapi import Depends, FastAPI

def get_model():
    return app.state.model

@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest, model=Depends(get_model)):
    score = float(model.score(req.text))
    return PredictResponse(
        label="positive" if score >= req.threshold else "negative",
        score=score,
    )

How Ephizen uses it

Every model we ship runs behind FastAPI. The shape is always the same:

  1. lifespan context loads weights into app.state once (never per-request).
  2. Handlers validate input with Pydantic, hand off to the model, return a typed response.
  3. Heavy CPU work (e.g. an XGBoost predict) runs in a threadpool via a plain def endpoint. Async I/O (embedding API calls, vector DB lookups) runs under async def.
  4. A /healthz endpoint returns 200 once the model is loaded so Kubernetes only routes traffic after warmup.

Running it

uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4

Rules of thumb for workers: start with 2 * cores for CPU-bound models, 1 worker with many async tasks for I/O-bound LLM proxies. Don’t over-worker a GPU model — each worker loads its own copy of the weights.

Common pitfalls

  • BackgroundTasks disappear on shutdown. FastAPI’s BackgroundTasks runs after the response is sent but inside the same process. If the pod is evicted mid-task, work is lost. For anything durable, push to a queue (Redis Streams, Kafka, SQS) and let a worker consume it.
  • CORS silently blocks browsers. Add CORSMiddleware explicitly. Without it, every call from a browser origin is a preflight failure with no server log. The FastAPI server looks fine from curl — developers lose hours.
  • Large uploads pin a worker. UploadFile buffers to disk, but the default uvicorn request body limit and proxy timeouts will bite you at ~100 MB. Stream chunks to S3 instead of holding them in memory.
  • Sync DB drivers under async def. Calling psycopg2 from async def blocks the event loop and kills throughput. Either use def + threadpool or switch to asyncpg.
  • Pydantic v1 vs v2. FastAPI ≥ 0.100 uses Pydantic v2. Validator syntax, .dict() vs .model_dump(), and config classes all changed. Pin versions.

Related tools