Glossary

DSPy

DSPy ("Demonstrate–Search–Predict") is a research framework from Omar Khattab and the Stanford NLP group. Its thesis: stop writing prompts; write programs and let an optimiser compile the prompts.

The shift in mental model

Traditional DSPy
Engineer hand-tunes prompt string Engineer declares signature question -> answer
Few-shot examples chosen manually Optimiser searches and selects examples
Pipeline tweaked by trial and error Optimiser optimises end-to-end metric

Core abstractions

  1. Signature, a typed I/O contract: question: str -> answer: str.
  2. Module, a unit of computation (Predict, ChainOfThought, ReAct, ProgramOfThought).
  3. Program, composition of modules.
  4. Optimiser (Teleprompter), compiles a program: searches over prompts, few-shot selections, or fine-tunes weights to maximise a metric.

Example

import dspy

class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""
    question = dspy.InputField()
    answer   = dspy.OutputField(desc="often 1-5 words")

class RAG(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=3)
        self.generate = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

# Compile against a small training set + metric
optimiser = dspy.BootstrapFewShot(metric=dspy.evaluate.answer_exact_match)
compiled = optimiser.compile(RAG(), trainset=trainset)

Notice there is no prompt string anywhere. The compiler synthesises and tunes prompts automatically.

Optimisers

  • BootstrapFewShot, generates and filters demonstrations.
  • MIPRO (Multi-prompt Instruction PRoposal Optimiser), joint optimisation over instructions and few-shot.
  • BootstrapFinetune, actually updates model weights via knowledge distillation.

Why this matters

Hand-tuned prompts are brittle: change the model and your prompt must be re-engineered. DSPy treats prompt-engineering the way PyTorch treats hyper-parameter search, as an optimisation problem. The same program runs on GPT-4, Claude, or Llama-3 with the optimiser handling the model-specific phrasing.

Empirical results

On HotpotQA, GSM8K, and other benchmarks DSPy-compiled programs reliably match or beat hand-tuned prompts, with 1–3 hours of optimiser time replacing weeks of human prompt engineering.

Modern relevance

DSPy is the most academically influential framework: it has popularised prompt-as-program thinking and inspired the optimiser ecosystem (TextGrad, OPRO, Adalflow). For production it has a steeper learning curve than LangChain but pays off when an evaluation metric is well-defined.

Citation

Khattab, O. et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714.

Related terms: LangChain, LlamaIndex, Chain-of-Thought, ReAct, Retrieval-Augmented Generation

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).