vLLM

High-throughput LLM inference server with PagedAttention, continuous batching, and an OpenAI-compatible API.

Category
API & Serving
Difficulty
Intermediate
When to use
You're self-hosting an open-weights LLM and care about throughput, latency, and GPU utilization.
When not to use
You only need to run a model on a laptop — llama.cpp or Ollama is simpler.
Alternatives
TensorRT-LLM TGI llama.cpp Ollama

At a glance

FieldValue
CategoryLLM inference server
DifficultyIntermediate
When to useProduction self-hosted LLM serving on GPUs
When not to useLaptop / edge inference
AlternativesTensorRT-LLM, TGI, llama.cpp, Ollama

What it is

vLLM is a high-performance inference server for open LLMs. Its core contributions are PagedAttention (KV cache stored in fixed-size pages so memory fragmentation doesn’t kill throughput) and continuous batching (new requests join the running batch between decode steps). It exposes an OpenAI-compatible /v1/chat/completions endpoint so existing OpenAI clients work unchanged.

When we reach for it at Ephizen

  • Hosting Llama, Mistral, Qwen, or similar open models behind an API.
  • Replacing an OpenAI dependency with a self-hosted model without rewriting client code.
  • Any GPU serving job where tokens-per-second and tail latency actually matter.

Getting started

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --dtype bfloat16 \
  --max-model-len 8192
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-used")
client.chat.completions.create(model="meta-llama/Llama-3.2-3B-Instruct",
                               messages=[{"role": "user", "content": "hi"}])

Gotchas

  • Tensor parallel across multiple GPUs needs careful --tensor-parallel-size tuning — bigger isn’t always faster.
  • Max model length eats KV cache memory quadratically; set it to what you actually need.
  • Quantized models (AWQ, GPTQ, FP8) are supported but check current coverage for your specific model.

Related tools