Claude Vision is the image-input capability shipped with Anthropic's Claude 3 model family in March 2024 and carried through Claude 3.5 Sonnet (June 2024), Claude 3.5 Haiku, and Claude 4 (2025). It accepts arbitrary images alongside text in the same prompt, supporting up to 20 images per message and resolutions up to roughly $1568$ pixels on the long edge before automatic downscaling.
Capabilities. Claude Vision performs the standard VLM tasks, captioning, VQA, OCR, chart and table reading, and additionally emphasises:
- Document understanding. Extracts structured data from forms, invoices, receipts, and financial filings, including handwritten annotations.
- Image-grounded reasoning. Solves geometry problems from sketches, debugs UI mockups, interprets architectural floor plans.
- Code from screenshots. Generates HTML, React, or SwiftUI matching a screenshot or hand-drawn sketch, a capability that motivated the Claude 3.5 Artifacts feature.
- Safety-aware refusal. Declines to identify private individuals from photos and to read biometric identifiers, more conservatively than GPT-4V.
Architectural notes. Anthropic has not published Claude's architecture. Public statements indicate Claude Vision uses a patch-token approach similar to LLaVA: a vision encoder produces patches, which are projected into the language-model token stream. The Claude 3 series shares one architecture across Opus / Sonnet / Haiku, scaling parameter count and training compute.
Foundation for Computer Use. In October 2024, Anthropic released Claude Computer Use (claude-3-5-sonnet-20241022), the first publicly available frontier model trained to act on screenshots: given a desktop screenshot and a goal, the model emits structured tool calls (mouse_move, left_click, type_text, key) that drive a real or virtual machine. This required Claude Vision to interpret arbitrary GUIs, including pixel-precise coordinates, an elevation of vision from passive understanding to action. Claude 4 extended this to Computer Use API with improved reliability and safety controls.
Comparison with peers. On standard multimodal benchmarks (MMMU, MathVista, ChartQA), Claude 3.5 Sonnet and Claude 4 Opus trade blows with GPT-4o and Gemini 2.5 Pro; specific rankings shift across releases. In computer-use evaluations (OSWorld, WebArena), Claude Computer Use was the first frontier model to post non-trivial success rates, and remains a reference point.
Related terms: Claude 3.5 Sonnet Computer Use, Vision-Language Model, GPT-4V and GPT-4o Vision, Gemini Multimodal, Constitutional AI, Claude
Discussed in:
- Chapter 11: CNNs, Vision-Language Models