Glossary

Claude Vision

Claude Vision is the image-input capability shipped with Anthropic's Claude 3 model family in March 2024 and carried through Claude 3.5 Sonnet (June 2024), Claude 3.5 Haiku, and Claude 4 (2025). It accepts arbitrary images alongside text in the same prompt, supporting up to 20 images per message and resolutions up to roughly $1568$ pixels on the long edge before automatic downscaling.

Capabilities. Claude Vision performs the standard VLM tasks, captioning, VQA, OCR, chart and table reading, and additionally emphasises:

  • Document understanding. Extracts structured data from forms, invoices, receipts, and financial filings, including handwritten annotations.
  • Image-grounded reasoning. Solves geometry problems from sketches, debugs UI mockups, interprets architectural floor plans.
  • Code from screenshots. Generates HTML, React, or SwiftUI matching a screenshot or hand-drawn sketch, a capability that motivated the Claude 3.5 Artifacts feature.
  • Safety-aware refusal. Declines to identify private individuals from photos and to read biometric identifiers, more conservatively than GPT-4V.

Architectural notes. Anthropic has not published Claude's architecture. Public statements indicate Claude Vision uses a patch-token approach similar to LLaVA: a vision encoder produces patches, which are projected into the language-model token stream. The Claude 3 series shares one architecture across Opus / Sonnet / Haiku, scaling parameter count and training compute.

Foundation for Computer Use. In October 2024, Anthropic released Claude Computer Use (claude-3-5-sonnet-20241022), the first publicly available frontier model trained to act on screenshots: given a desktop screenshot and a goal, the model emits structured tool calls (mouse_move, left_click, type_text, key) that drive a real or virtual machine. This required Claude Vision to interpret arbitrary GUIs, including pixel-precise coordinates, an elevation of vision from passive understanding to action. Claude 4 extended this to Computer Use API with improved reliability and safety controls.

Comparison with peers. On standard multimodal benchmarks (MMMU, MathVista, ChartQA), Claude 3.5 Sonnet and Claude 4 Opus trade blows with GPT-4o and Gemini 2.5 Pro; specific rankings shift across releases. In computer-use evaluations (OSWorld, WebArena), Claude Computer Use was the first frontier model to post non-trivial success rates, and remains a reference point.

Related terms: Claude 3.5 Sonnet Computer Use, Vision-Language Model, GPT-4V and GPT-4o Vision, Gemini Multimodal, Constitutional AI, Claude

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).