Glossary

Browser-Use Agents

Browser-use agents are a class of LLM agent whose action space is the set of operations a human takes on a web page: click, type, scroll, navigate, screenshot, extract DOM. Unlike API-only agents, they operate on websites that have no API, which is most of the web.

Architectures

Three competing approaches have emerged:

  1. DOM-based, the agent receives a parsed accessibility tree with element IDs ([12] Submit button); it emits actions like click(12). The open-source browser-use library, AgentE and WebVoyager use this. Token-efficient and reliable on standard pages.

  2. Screenshot-based, the agent receives a screenshot and emits coordinate clicks (x, y). OpenAI's Operator (Jan 2025) and Anthropic's Claude Computer Use browser mode work this way. Handles canvas and weird DOM-defying sites but requires strong visual grounding.

  3. Hybrid, DOM + screenshot. Most production systems including Operator's "Computer Use" and the multimodal browser-use config.

Toolset

A typical browser agent exposes ~10 atomic actions:

click(element_id_or_xy)
type(text, into=element_id)
scroll(direction, amount)
navigate(url)
back()
forward()
extract_text()
screenshot()
wait(seconds)
finish(answer)

The agent loop is a ReAct cycle over these tools, often with screenshot observation between every action.

Benchmarks

  • WebArena (Zhou et al. 2023), 812 real-world tasks across 6 self-hosted sites (Reddit, GitLab, e-commerce). State of the art rose from 14% (2023) to ~58% (2025).
  • WebVoyager (He et al. 2024), 643 tasks on live websites.
  • OSWorld, broader OS-level (overlaps computer-use).

Production deployments

  • OpenAI Operator (Jan 2025), consumer-facing browser agent for Pro subscribers. Uses computer-use-trained gpt-4o derivative.
  • Claude Computer Use (Oct 2024), generic computer control including browser; see claude-computer-use.
  • Adept ACT-1 (precursor, 2022), first to demonstrate browser-driving from natural language.

Challenges

  • Login walls and captchas, agents cannot solve captchas without explicit handover.
  • Rate limits and bot detection, Cloudflare, hCaptcha increasingly fingerprint headless Chromium.
  • Long-horizon coherence, a 50-step task with 10 page loads accumulates errors fast.
  • Cost, every step is a screenshot + multimodal LLM call, $0.05–0.20 per step.

Relationship

Browser-use is the specialisation of tool use to a single high-leverage tool, the browser. It overlaps heavily with computer-use which generalises to the entire desktop.

Related terms: Computer-Use Agents, Claude 3.5 Sonnet Computer Use, Tool Use, ReAct

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).