DSPy ("Demonstrate–Search–Predict") is a research framework from Omar Khattab and the Stanford NLP group. Its thesis: stop writing prompts; write programs and let an optimiser compile the prompts.
The shift in mental model
| Traditional | DSPy |
|---|---|
| Engineer hand-tunes prompt string | Engineer declares signature question -> answer |
| Few-shot examples chosen manually | Optimiser searches and selects examples |
| Pipeline tweaked by trial and error | Optimiser optimises end-to-end metric |
Core abstractions
- Signature, a typed I/O contract:
question: str -> answer: str. - Module, a unit of computation (
Predict,ChainOfThought,ReAct,ProgramOfThought). - Program, composition of modules.
- Optimiser (Teleprompter), compiles a program: searches over prompts, few-shot selections, or fine-tunes weights to maximise a metric.
Example
import dspy
class GenerateAnswer(dspy.Signature):
"""Answer questions with short factoid answers."""
question = dspy.InputField()
answer = dspy.OutputField(desc="often 1-5 words")
class RAG(dspy.Module):
def __init__(self):
self.retrieve = dspy.Retrieve(k=3)
self.generate = dspy.ChainOfThought(GenerateAnswer)
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)
# Compile against a small training set + metric
optimiser = dspy.BootstrapFewShot(metric=dspy.evaluate.answer_exact_match)
compiled = optimiser.compile(RAG(), trainset=trainset)
Notice there is no prompt string anywhere. The compiler synthesises and tunes prompts automatically.
Optimisers
- BootstrapFewShot, generates and filters demonstrations.
- MIPRO (Multi-prompt Instruction PRoposal Optimiser), joint optimisation over instructions and few-shot.
- BootstrapFinetune, actually updates model weights via knowledge distillation.
Why this matters
Hand-tuned prompts are brittle: change the model and your prompt must be re-engineered. DSPy treats prompt-engineering the way PyTorch treats hyper-parameter search, as an optimisation problem. The same program runs on GPT-4, Claude, or Llama-3 with the optimiser handling the model-specific phrasing.
Empirical results
On HotpotQA, GSM8K, and other benchmarks DSPy-compiled programs reliably match or beat hand-tuned prompts, with 1–3 hours of optimiser time replacing weeks of human prompt engineering.
Modern relevance
DSPy is the most academically influential framework: it has popularised prompt-as-program thinking and inspired the optimiser ecosystem (TextGrad, OPRO, Adalflow). For production it has a steeper learning curve than LangChain but pays off when an evaluation metric is well-defined.
Citation
Khattab, O. et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714.
Related terms: LangChain, LlamaIndex, Chain-of-Thought, ReAct, Retrieval-Augmented Generation
Discussed in:
- Chapter 15: Modern AI, Modern AI