Ollama

A one-command local runner for open LLMs, built on llama.cpp. Pulls models from a registry and exposes an HTTP API.

Category
API & Serving
Difficulty
Beginner
When to use
You want to run an open LLM on your laptop or a small server with zero setup — demos, prototypes, offline work.
When not to use
You're serving at production scale with many concurrent users — use vLLM or TGI instead.
Alternatives
llama.cpp LM Studio vLLM GPT4All

At a glance

FieldValue
CategoryLocal LLM runner
DifficultyBeginner
When to useLaptop / offline / prototype LLM work
When not to useMulti-user production serving
Alternativesllama.cpp, LM Studio, vLLM

What it is

Ollama wraps llama.cpp with a model registry, a simple CLI, and an HTTP API. ollama run llama3.2 pulls the model (quantized GGUF), starts a local server, and drops you into a chat. The API is compatible enough with OpenAI’s chat-completions shape that most LangChain/LlamaIndex integrations just work.

When we reach for it at Ephizen

  • Prototyping an LLM feature without a GPU or an API bill.
  • Demos on airplanes and conference floors where Wi-Fi is unreliable.
  • Onboarding new engineers — five minutes from install to first token.
  • Running a small embedding or reranker model alongside a FastAPI service.

Getting started

brew install ollama            # or your platform's installer
ollama serve &                  # run the daemon
ollama pull llama3.2            # ~2 GB quantized
ollama run llama3.2 "Explain attention in one sentence."
import requests
r = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "hi"}],
    "stream": False,
})
print(r.json()["message"]["content"])

Gotchas

  • Default models are heavily quantized (q4). Quality is fine for chat, less fine for eval work — use a larger quant for serious tests.
  • GPU acceleration on Linux requires CUDA; on Mac it uses Metal automatically.
  • Concurrency is limited — one model loaded at a time by default. For production throughput, use vLLM.

Related tools