Toolformer, Glossary, Textbook of AI

Toolformer was introduced by Schick et al. (Meta AI, 2023) in "Toolformer: Language Models Can Teach Themselves to Use Tools". It is the seminal answer to the question: how do you train tool use into a model when you don't have a labelled dataset of correct tool calls?

The self-supervised recipe

Take a vanilla LLM and a corpus $C = \{x_1, x_2, \dots\}$.
For each text position, prompt the LLM with a few-shot example asking "insert a useful API call here".
Sample candidate annotations, e.g. the sentence "Pittsburgh is also known as the Steel City" might be augmented with [QA(What is Pittsburgh known for?) → Steel City].
Execute the API call to get the response.
Filtering criterion, keep the annotation only if it lowers the weighted cross-entropy loss of the subsequent tokens versus the unaugmented baseline. Formally, retain call $c$ if

$$L_i(c) - L_i(\varepsilon) < -\tau$$

where $L_i(c)$ is the loss with the call inserted, $L_i(\varepsilon)$ without, and $\tau$ a margin.
Fine-tune the model on the surviving annotated corpus.

The brilliant insight is that a tool call is useful only if it makes the surrounding text easier to predict. A wrong call ([QA(What is the capital of Pittsburgh?)]) hurts perplexity and is filtered. A right call ([Calculator(400 / 1400) → 0.286]) helps.

Tools tested

QA system
Calculator
Wikipedia search
Machine translation
Calendar

Results

A 6.7B Toolformer-trained GPT-J beat GPT-3 (175B) on several benchmarks (LAMA, math word problems, multilingual QA) despite being 25× smaller, purely because it knew when to outsource to a tool.

Modern relevance

Toolformer is the conceptual predecessor of every modern function-calling fine-tune. Today's pipelines from OpenAI, Anthropic and Google use vastly larger and richer synthetic datasets, but the core trick, letting the model annotate its own corpus and filtering by downstream usefulness, remains influential. It also prefigures process reward models and self-play training in reasoning models.

Limitations

Each call is independent, no chain of tool calls, unlike ReAct.
Tools are baked into model weights at training time; modern function calling is dynamic at inference time.

Citation

Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. arXiv:2302.04761.

Related terms: Tool Use, Function Calling, ReAct

Discussed in:

Chapter 15: Modern AI, Tool Use

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).