DSPy

A framework for programming (not prompting) LLMs — declare signatures and modules, then let an optimizer compile prompts and few-shot examples for you.

Category
LLM & Agent Frameworks
Difficulty
Advanced
When to use
You have a well-defined task with examples and want the framework to automatically search over prompts, few-shot demos, and even fine-tunes.
When not to use
You have no labeled examples or eval metric — DSPy's superpower is optimization, and without data there's nothing to optimize.
Alternatives
LangChain Raw prompt engineering TextGrad

At a glance

FieldValue
CategoryLLM programming framework
DifficultyAdvanced
When to useTasks with examples and metrics to optimize
When not to useAd hoc prompting with no labeled data
AlternativesLangChain, TextGrad, hand-tuned prompts

What it is

DSPy (from Stanford NLP) treats LLM pipelines like programs. You declare Signature types (input fields → output fields) and compose Modules (Predict, ChainOfThought, ReAct). A compiler then searches — using your training examples and metric — for the best few-shot demonstrations, prompt templates, or fine-tuned weights. The net effect is “stop hand-crafting prompts; let the optimizer handle it”.

When we reach for it at Ephizen

  • Multi-step pipelines where individual prompts are tangled and brittle.
  • Tasks where we have a good eval set and want reproducible improvements over time.
  • Porting a prompt-heavy pipeline to a smaller, cheaper model via prompt search.

Getting started

import dspy
dspy.settings.configure(lm=dspy.LM("openai/gpt-4o-mini"))

class ExtractFacts(dspy.Signature):
    """Extract a list of facts from the text."""
    text: str = dspy.InputField()
    facts: list[str] = dspy.OutputField()

extract = dspy.ChainOfThought(ExtractFacts)
print(extract(text="The capital of France is Paris.").facts)

Gotchas

  • The mental model is different from LangChain. Expect a real ramp-up.
  • The optimizer can burn a lot of tokens during compilation — use a cheap model for optimization passes.
  • DSPy’s best results come from iterating on the metric, not the prompt. If your metric is bad, DSPy will cheerfully optimize the wrong thing.

Related tools