Computer-Use Agents, Glossary, Textbook of AI

Computer-use agents are the most general form of tool-using AI: their action space is the entire desktop. They see what the user sees (screenshots) and act with the same atomic primitives as a human (mouse moves, clicks, key presses). Anything a person can do in front of a screen, they can in principle do.

The breakthrough: Claude 3.5 Computer Use

Anthropic launched Claude 3.5 Sonnet "computer use" beta on 22 October 2024. The model was post-trained on screenshots paired with cursor coordinates and keyboard events, so it can answer prompts like "Open Excel, sum column B, and email the result to Jane" by:

Taking a screenshot.
Identifying the Excel icon's (x, y) pixel location.
Emitting mouse_move(123, 456); click().
Taking another screenshot to verify.
Iterating.

See claude-computer-use.

Action API

The Anthropic computer-use tool exposes:

type ComputerAction =
  | { action: "key"; text: string }              # e.g. "ctrl+c"
  | { action: "type"; text: string }
  | { action: "mouse_move"; coordinate: [int,int] }
  | { action: "left_click" }
  | { action: "left_click_drag"; coordinate: [int,int] }
  | { action: "right_click" }
  | { action: "middle_click" }
  | { action: "double_click" }
  | { action: "screenshot" }
  | { action: "cursor_position" }

Sandbox

Production deployments run inside a virtual machine for safety:

Anthropic reference impl, Docker container with Xvfb + xdotool.
OpenAI Operator, virtual Chromebook in cloud sandbox.
Devin, full Linux VM with terminal, browser, IDE.

The agent never touches the user's real machine; output (e.g. a generated file) is exfiltrated explicitly.

Benchmarks

OSWorld (Xie et al. 2024), 369 real desktop tasks on Ubuntu, Windows, macOS. Claude 3.5 hit 14.9% in Oct 2024; SOTA reached ~38% by mid-2025.
WindowsAgentArena, Windows-specific.
VisualWebArena, overlap with browser-use.

Productisations (2024–2025)

System	Date	Notes
Devin (Cognition)	Mar 2024	First mainstream demo; full SWE workflow
Claude Computer Use	Oct 2024	First frontier-model API
OpenAI Operator	Jan 2025	Consumer browser-focused
Claude 4.5 + 4.7	2025	Significantly better grounding, used by Claude Code

Limitations

Visual grounding errors, clicking 5 px to the left of a button is a complete task failure.
Latency, each step is a screenshot + multimodal forward pass; 5–15 s per click.
Security, prompt injection from a malicious web page can hijack the agent (the "agent hijacking" class of attack).
Determinism, UI changes break recorded traces; agents must re-plan.

Relationship

Computer-use subsumes browser-use and adds desktop apps, terminals, IDEs. It is the substrate beneath autonomous SWE agents like Devin, OpenHands, and OpenAI Codex (2025).

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).