vLLM
High-throughput LLM inference server with PagedAttention, continuous batching, and an OpenAI-compatible API.
Category
API & Serving
Difficulty
Intermediate
When to use
You're self-hosting an open-weights LLM and care about throughput, latency, and GPU utilization.
When not to use
You only need to run a model on a laptop — llama.cpp or Ollama is simpler.
Alternatives
TensorRT-LLM TGI llama.cpp Ollama
At a glance
| Field | Value |
|---|---|
| Category | LLM inference server |
| Difficulty | Intermediate |
| When to use | Production self-hosted LLM serving on GPUs |
| When not to use | Laptop / edge inference |
| Alternatives | TensorRT-LLM, TGI, llama.cpp, Ollama |
What it is
vLLM is a high-performance inference server for open LLMs. Its core contributions are PagedAttention (KV cache stored in fixed-size pages so memory fragmentation doesn’t kill throughput) and continuous batching (new requests join the running batch between decode steps). It exposes an OpenAI-compatible /v1/chat/completions endpoint so existing OpenAI clients work unchanged.
When we reach for it at Ephizen
- Hosting Llama, Mistral, Qwen, or similar open models behind an API.
- Replacing an OpenAI dependency with a self-hosted model without rewriting client code.
- Any GPU serving job where tokens-per-second and tail latency actually matter.
Getting started
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--dtype bfloat16 \
--max-model-len 8192
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-used")
client.chat.completions.create(model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": "hi"}])
Gotchas
- Tensor parallel across multiple GPUs needs careful
--tensor-parallel-sizetuning — bigger isn’t always faster. - Max model length eats KV cache memory quadratically; set it to what you actually need.
- Quantized models (AWQ, GPTQ, FP8) are supported but check current coverage for your specific model.