Ollama
A one-command local runner for open LLMs, built on llama.cpp. Pulls models from a registry and exposes an HTTP API.
Category
API & Serving
Difficulty
Beginner
When to use
You want to run an open LLM on your laptop or a small server with zero setup — demos, prototypes, offline work.
When not to use
You're serving at production scale with many concurrent users — use vLLM or TGI instead.
Alternatives
llama.cpp LM Studio vLLM GPT4All
At a glance
| Field | Value |
|---|---|
| Category | Local LLM runner |
| Difficulty | Beginner |
| When to use | Laptop / offline / prototype LLM work |
| When not to use | Multi-user production serving |
| Alternatives | llama.cpp, LM Studio, vLLM |
What it is
Ollama wraps llama.cpp with a model registry, a simple CLI, and an HTTP API. ollama run llama3.2 pulls the model (quantized GGUF), starts a local server, and drops you into a chat. The API is compatible enough with OpenAI’s chat-completions shape that most LangChain/LlamaIndex integrations just work.
When we reach for it at Ephizen
- Prototyping an LLM feature without a GPU or an API bill.
- Demos on airplanes and conference floors where Wi-Fi is unreliable.
- Onboarding new engineers — five minutes from install to first token.
- Running a small embedding or reranker model alongside a FastAPI service.
Getting started
brew install ollama # or your platform's installer
ollama serve & # run the daemon
ollama pull llama3.2 # ~2 GB quantized
ollama run llama3.2 "Explain attention in one sentence."
import requests
r = requests.post("http://localhost:11434/api/chat", json={
"model": "llama3.2",
"messages": [{"role": "user", "content": "hi"}],
"stream": False,
})
print(r.json()["message"]["content"])
Gotchas
- Default models are heavily quantized (q4). Quality is fine for chat, less fine for eval work — use a larger quant for serious tests.
- GPU acceleration on Linux requires CUDA; on Mac it uses Metal automatically.
- Concurrency is limited — one model loaded at a time by default. For production throughput, use vLLM.