Computer-use agents are the most general form of tool-using AI: their action space is the entire desktop. They see what the user sees (screenshots) and act with the same atomic primitives as a human (mouse moves, clicks, key presses). Anything a person can do in front of a screen, they can in principle do.
The breakthrough: Claude 3.5 Computer Use
Anthropic launched Claude 3.5 Sonnet "computer use" beta on 22 October 2024. The model was post-trained on screenshots paired with cursor coordinates and keyboard events, so it can answer prompts like "Open Excel, sum column B, and email the result to Jane" by:
- Taking a screenshot.
- Identifying the Excel icon's
(x, y)pixel location. - Emitting
mouse_move(123, 456); click(). - Taking another screenshot to verify.
- Iterating.
See claude-computer-use.
Action API
The Anthropic computer-use tool exposes:
type ComputerAction =
| { action: "key"; text: string } # e.g. "ctrl+c"
| { action: "type"; text: string }
| { action: "mouse_move"; coordinate: [int,int] }
| { action: "left_click" }
| { action: "left_click_drag"; coordinate: [int,int] }
| { action: "right_click" }
| { action: "middle_click" }
| { action: "double_click" }
| { action: "screenshot" }
| { action: "cursor_position" }
Sandbox
Production deployments run inside a virtual machine for safety:
- Anthropic reference impl, Docker container with Xvfb + xdotool.
- OpenAI Operator, virtual Chromebook in cloud sandbox.
- Devin, full Linux VM with terminal, browser, IDE.
The agent never touches the user's real machine; output (e.g. a generated file) is exfiltrated explicitly.
Benchmarks
- OSWorld (Xie et al. 2024), 369 real desktop tasks on Ubuntu, Windows, macOS. Claude 3.5 hit 14.9% in Oct 2024; SOTA reached ~38% by mid-2025.
- WindowsAgentArena, Windows-specific.
- VisualWebArena, overlap with browser-use.
Productisations (2024–2025)
| System | Date | Notes |
|---|---|---|
| Devin (Cognition) | Mar 2024 | First mainstream demo; full SWE workflow |
| Claude Computer Use | Oct 2024 | First frontier-model API |
| OpenAI Operator | Jan 2025 | Consumer browser-focused |
| Claude 4.5 + 4.7 | 2025 | Significantly better grounding, used by Claude Code |
Limitations
- Visual grounding errors, clicking 5 px to the left of a button is a complete task failure.
- Latency, each step is a screenshot + multimodal forward pass; 5–15 s per click.
- Security, prompt injection from a malicious web page can hijack the agent (the "agent hijacking" class of attack).
- Determinism, UI changes break recorded traces; agents must re-plan.
Relationship
Computer-use subsumes browser-use and adds desktop apps, terminals, IDEs. It is the substrate beneath autonomous SWE agents like Devin, OpenHands, and OpenAI Codex (2025).
Related terms: Browser-Use Agents, Claude 3.5 Sonnet Computer Use, Tool Use, Devin / AI Software Engineer, OpenHands
Discussed in:
- Chapter 15: Modern AI, Modern AI