Glossary

Computer-Use Agents

Computer-use agents are the most general form of tool-using AI: their action space is the entire desktop. They see what the user sees (screenshots) and act with the same atomic primitives as a human (mouse moves, clicks, key presses). Anything a person can do in front of a screen, they can in principle do.

The breakthrough: Claude 3.5 Computer Use

Anthropic launched Claude 3.5 Sonnet "computer use" beta on 22 October 2024. The model was post-trained on screenshots paired with cursor coordinates and keyboard events, so it can answer prompts like "Open Excel, sum column B, and email the result to Jane" by:

  1. Taking a screenshot.
  2. Identifying the Excel icon's (x, y) pixel location.
  3. Emitting mouse_move(123, 456); click().
  4. Taking another screenshot to verify.
  5. Iterating.

See claude-computer-use.

Action API

The Anthropic computer-use tool exposes:

type ComputerAction =
  | { action: "key"; text: string }              # e.g. "ctrl+c"
  | { action: "type"; text: string }
  | { action: "mouse_move"; coordinate: [int,int] }
  | { action: "left_click" }
  | { action: "left_click_drag"; coordinate: [int,int] }
  | { action: "right_click" }
  | { action: "middle_click" }
  | { action: "double_click" }
  | { action: "screenshot" }
  | { action: "cursor_position" }

Sandbox

Production deployments run inside a virtual machine for safety:

  • Anthropic reference impl, Docker container with Xvfb + xdotool.
  • OpenAI Operator, virtual Chromebook in cloud sandbox.
  • Devin, full Linux VM with terminal, browser, IDE.

The agent never touches the user's real machine; output (e.g. a generated file) is exfiltrated explicitly.

Benchmarks

  • OSWorld (Xie et al. 2024), 369 real desktop tasks on Ubuntu, Windows, macOS. Claude 3.5 hit 14.9% in Oct 2024; SOTA reached ~38% by mid-2025.
  • WindowsAgentArena, Windows-specific.
  • VisualWebArena, overlap with browser-use.

Productisations (2024–2025)

System Date Notes
Devin (Cognition) Mar 2024 First mainstream demo; full SWE workflow
Claude Computer Use Oct 2024 First frontier-model API
OpenAI Operator Jan 2025 Consumer browser-focused
Claude 4.5 + 4.7 2025 Significantly better grounding, used by Claude Code

Limitations

  • Visual grounding errors, clicking 5 px to the left of a button is a complete task failure.
  • Latency, each step is a screenshot + multimodal forward pass; 5–15 s per click.
  • Security, prompt injection from a malicious web page can hijack the agent (the "agent hijacking" class of attack).
  • Determinism, UI changes break recorded traces; agents must re-plan.

Relationship

Computer-use subsumes browser-use and adds desktop apps, terminals, IDEs. It is the substrate beneath autonomous SWE agents like Devin, OpenHands, and OpenAI Codex (2025).

Related terms: Browser-Use Agents, Claude 3.5 Sonnet Computer Use, Tool Use, Devin / AI Software Engineer, OpenHands

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).