15.13 Tools, function calling and agents

A frontier language model on its own is impressive but inert. It knows nothing past its training cut-off, cannot read the file you have open, cannot check the weather, cannot run a piece of code to verify its own arithmetic, and cannot send an email. Connect that same model to a small set of tools, a web search endpoint, a Python sandbox, a file system, a calendar API, a browser, and the situation changes dramatically. The model is now able to take actions in the world. With a loop wrapped around it, so that the result of each action becomes the input to the next decision, the model becomes an agent: a system that can pursue goals over multiple steps without a human in the loop on every turn.

The previous section (§15.11) covered retrieval-augmented generation, in which the model is grounded by reading documents pulled from an index. Agents extend that idea from reading to acting. RAG answers the question "what does the model need to know to reply well?" by fetching text. Agents answer the much larger question "what does the model need to do to complete the task?" by issuing commands. The same model weights underlie both; the difference is the surrounding harness. Devin, Cursor's agent mode, Claude with computer use, OpenAI's Operator, GitHub Copilot Workspace and the open-source OpenHands and Aider are all flavours of the same idea: a strong base model, a small inventory of tools, a loop, and a budget.

What an agent is

The simplest useful definition of an LLM agent has four ingredients: a model, a set of tools, some form of memory, and a control loop. The model's job is to look at the current state, what the user asked for, what has happened so far, what the latest tool returned, and decide what to do next. The decision is either to call a tool, with specified arguments, or to declare the task complete and return a final answer. The runtime executes the chosen tool, captures its output, appends that output to the conversation, and asks the model to decide again. The loop terminates when the model produces a final answer, when a step budget is exhausted, or when an error policy fires.

That description sounds trivial, and the wrapping code often is, a few dozen lines of Python suffice for a working prototype. The intelligence sits in three places. First, in the base model: a model that cannot reason its way through a multi-step plan will not become competent simply because you wrap a loop around it. Second, in the tool inventory: tools that are well-named, well-documented and orthogonal compose far better than tools that overlap or hide ambiguous failure modes. Third, in the memory architecture: the conversation history alone is rarely enough for tasks of any length, and a serious agent uses some combination of summarisation, scratch-pad notes, vector recall and structured state.

The loop also needs a budget. Without a cap on steps, tokens, monetary spend or side effects, an agent that fails to make progress will burn resources indefinitely, and an agent that has gone off the rails can cause real damage. Practical agent code therefore looks less like a clean recursion and more like an event loop with timeouts, retries, and explicit halt conditions. The budget is also where safety meets engineering: the same mechanism that stops a runaway loop is the mechanism that stops an agent from deleting too many files.

Tool use

The mechanics of tool use are now standardised across providers. The host application supplies the model with a list of tool definitions, each consisting of a name, a natural-language description of what the tool does and when to use it, and a JSON Schema describing its arguments and return value. At each turn the model may either reply in plain text or emit a structured tool call, a JSON object naming the tool and supplying arguments. The runtime parses the call, runs the underlying function, and returns the result as a tool-response message. The model continues, conditioning on that result, and the cycle repeats.

OpenAI's function calling, Anthropic's tool use API and Google's function-calling endpoint are all variations on this pattern. Anthropic's Model Context Protocol, released in late 2024, generalises it: MCP defines a JSON-RPC interface so that tools can live in separate processes, possibly on separate machines, and any MCP-aware client (an IDE, a chat application, an operating system shell) can discover and invoke them. By April 2026 MCP has become the de facto plug-in standard for the model ecosystem; major IDEs, productivity suites, version control systems, databases and even some operating systems expose MCP servers, and a single agent can compose tools across all of them.

Training a model to use tools well is a non-trivial fine-tuning task in its own right. The recipe by 2026 is roughly: synthesise large numbers of (instruction, tool-call, tool-result, follow-up) traces; supervised fine-tune on those traces; then apply RL with verifiable rewards in which the verifier checks that calls were syntactically valid, that arguments matched the schema, and that the final answer used the returned data correctly. Frontier models trained this way now emit valid tool calls in well over 99 per cent of cases for clearly described tools, and they will defer or ask for clarification when the user's request is ambiguous rather than guessing.

Two finer points matter in practice. The first is structured output: many tools require strict JSON, valid SQL or schema-conformant data. Constrained decoding, in which a finite-state machine over the tokeniser forbids any token that would break the grammar, guarantees validity but can damage quality if the model's natural distribution drifts from the constraint. Modern systems combine schema-conditioned training with a constrained decoder used as a safety net rather than a primary mechanism. The second is tool selection: when the inventory grows to dozens of tools, the model can become confused about which to invoke. Hierarchical tool menus and on-demand tool retrieval, using embedding search over tool descriptions to surface only the relevant ones, are now standard for large agentic systems.

ReAct framework

The ReAct framework, introduced by Yao and colleagues in 2022, gave the field a clean prompting pattern for agentic behaviour and a vocabulary that has stuck. ReAct stands for Reasoning + Acting. At each step the model produces a Thought, which is a brief natural-language plan or reflection; an Action, which is a tool invocation; and then receives an Observation, which is the tool's result. The loop continues until the model produces a final answer instead of an action.

A worked example clarifies the pattern. Asked when Apollo 11 launched, the model writes "Thought: I need the launch date; I'll search the web." It then emits "Action: search_web('Apollo 11 launch date')". The runtime returns "Observation: 16 July 1969." The model writes a new thought, "I have the date", and produces the final answer. The thoughts make the plan legible to a human reader; the actions ground the model in the external world; the observations correct the model when its prior beliefs are wrong.

ReAct improved both task accuracy and interpretability over pure chain-of-thought prompting and over pure act-only baselines, particularly on multi-hop question answering and on the ALFWorld embodied benchmark. The pattern has since been absorbed into training: most modern frontier models are post-trained on ReAct-shaped traces so that the alternation of thinking and acting becomes a natural mode rather than a prompted one. Extended-thinking models such as Claude with extended thinking, OpenAI's o-series and DeepSeek-R1 generalise this further by allowing arbitrarily long reasoning between actions, with the agent harness deciding when to interrupt thinking with a tool call.

SWE-Bench: code agents

If there is one benchmark that has tracked the rise of agents in public, it is SWE-Bench. Introduced by Jimenez and colleagues in 2023, SWE-Bench draws real GitHub issues from large open-source Python repositories and asks the agent to produce a patch that resolves the issue and passes the project's hidden test suite. The agent must read the codebase, locate the relevant files, understand the bug, write a fix, and verify it, all the things a junior engineer does on their first ticket.

The progression of scores on SWE-Bench Verified, the human-curated subset, has been stark. In early 2024 the best public systems scored under five per cent. By the middle of 2024 a few crossed twenty per cent. By the end of 2024 the leaders were near fifty per cent. By April 2026, agentic systems built on Claude Sonnet 4 and Opus 4 with extended thinking and computer use, on DeepSeek-R2, and on GPT-5-class models exceed seventy-five per cent pass-at-1. Devin from Cognition, Cursor's agent, Claude Code, OpenHands and Aider sit in the leading cluster, with the gaps between them often within noise from one release to the next.

Three architectural patterns have converged. Conversational coding tools such as Cursor, Claude Code and Copilot Workspace put the developer in the loop on every meaningful step, mediating tool use through chat. Autonomous agents such as Devin and OpenHands run for hours with reduced supervision and present finished work for asynchronous review. Reviewer agents such as CodeRabbit and GitHub Copilot Code Review read pull requests and post comments, slotting into existing developer workflow rather than replacing it. The thick scaffolding of 2023, LangChain-style planner-executor frameworks with elaborate prompt graphs, has largely lost ground to thin scaffolding around a strong reasoning model. The model is smart enough that the scaffolding mostly gets in the way.

Computer use

The natural extension of tool use is to give the model not a curated inventory of APIs but a screen, a mouse and a keyboard. Anthropic's computer use API, released in October 2024, and OpenAI's Operator, released in January 2025, both implement this idea. The model is given a screenshot, decides what to do, emits an action, "click at (320, 412)", "type 'invoice'", "scroll down", the host executes it on a virtual desktop, captures a new screenshot, and the loop continues.

Computer use is far more general than function calling because it lets the model use software for which no API exists, including legacy desktop applications and websites that resist scraping. It is also far more error-prone. Screenshots are a high-bandwidth, weakly structured signal; the action space is enormous; and small misjudgements in pixel coordinates produce wrong clicks. By April 2026 computer use is reliable enough for some workflows, data entry, form filling, simple research, repetitive browser automation, but not yet for high-stakes ones such as financial trading, clinical decision-making or unsupervised system administration. Even so, it is the clearest path to general-purpose desktop automation, and progress between releases has been rapid.

Long-horizon limitations

The hardest open problem for agents in 2026 is the long-horizon task. A surgeon-grade diagnostician, a competent project manager, a careful systems administrator: each of these humans threads tens or hundreds of dependent decisions over hours or days, recovering from small errors as they go. Agents struggle here, and the arithmetic is unforgiving. If each step succeeds independently with probability 0.9, then ten steps in a row succeed with probability 0.9^10, which is about 0.35; twenty steps drop to 0.12. Any task whose success requires every intermediate step to be right will collapse rapidly as the chain lengthens.

In practice the picture is a little better than that worst case, because steps are not independent, a competent agent notices a mistake, backtracks, and tries again, and a little worse, because some mistakes are silent and propagate. Real long-horizon failures cluster around four issues: drift, in which the agent forgets the original objective and follows a side-quest; hallucinated success, in which the model declares the task done despite incomplete state; cumulative context dilution, in which earlier evidence is summarised away or pushed out of the window; and tool errors that the agent does not recognise as errors.

Mitigations are an active research area. Memory architectures beyond the context window, such as vector stores of past actions and outcomes or structured external state, help with dilution. Planner-executor splits, in which one model writes a plan and another executes it, help with drift. Self-critique steps, in which the agent is asked after each action whether it actually worked, catch hallucinated success. Tool-error wrappers that translate stack traces into actionable feedback help with silent failures. None of these is a silver bullet, and frontier work in 2026 increasingly treats long-horizon competence as the central capability question rather than a downstream engineering detail.

What you should take away

  1. An agent is a model plus tools plus memory plus a loop, with a budget. Every other detail is engineering around those four ingredients.
  2. Function calling is the standard tool-use protocol. A JSON schema, a structured call, an executed result fed back into the conversation. MCP has emerged as the cross-vendor interface.
  3. ReAct made interleaved reasoning and acting the default pattern. Modern frontier models are now trained on ReAct-shaped traces, with extended thinking generalising the approach.
  4. SWE-Bench tracks real agent progress. Scores moved from under five per cent in early 2024 to over seventy-five per cent by April 2026, driven by stronger base models and thinner scaffolding.
  5. Long-horizon reliability is the open problem. Per-step success compounds badly; mitigations exist but no general solution does, and this is where most frontier agent research now sits.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).