GPT-4V ("GPT-4 with vision") is the multimodal upgrade to GPT-4 released by OpenAI in September 2023, adding image input to a model that previously accepted only text. GPT-4o ("o" for "omni"), released in May 2024, is a successor trained from scratch with image, audio, and text as native modalities, and is roughly half the cost and twice the speed of GPT-4-Turbo with vision.
Capabilities. GPT-4V/4o set the de facto bar for production multimodal capability:
- Document and chart understanding. Read PDFs, extract tables, interpret financial charts, parse handwriting.
- Screenshot reasoning. Read web pages, mobile UIs, and IDE screenshots, then describe or act on them. This is the foundation of OpenAI's Operator agent and competing systems like Claude Computer Use.
- Visual grounding. Identify objects, count instances (with known limitations), describe scenes, identify text in images including handwriting and non-Latin scripts.
- Visual reasoning. Answer multi-step questions about diagrams, including physics problems and circuit diagrams.
GPT-4o multimodality. GPT-4o accepts image, text, and audio input and can emit text and audio output, with the audio path supporting roughly 320ms latency end-to-end. The model tokenises audio with a learned codec and inserts those tokens directly in the same context as text and image tokens; this lets it preserve prosody, laughter, and singing in ways that pipelined ASR + LLM + TTS systems cannot. Image input is tiled and processed at multiple resolutions.
Architecture (inferred). OpenAI has not published GPT-4o's architecture, but external evidence and the Chameleon paper from Meta suggest a single transformer trained on interleaved image, audio, and text tokens, with modality-specific tokenisers (a ViT-style patch encoder for images, a learned audio codec, a BPE tokeniser for text). This is structurally similar to Gemini's native-multimodal design.
API and pricing. Images count for a base cost plus per-tile cost (e.g. 85 tokens per low-detail image, 170 tokens per $512 \times 512$ tile in high-detail). Maximum image resolution is approximately $2048 \times 2048$.
Limitations. Hallucinated objects, weak fine-grained counting, occasional refusals on benign text-in-image (faces, license plates), and degraded performance on dense small text remain. GPT-4V cannot generate images directly, image generation is delegated to DALL-E 3 via tool use.
Related terms: Vision-Language Model, Claude 3.5 Sonnet Computer Use, Gemini 2.x, ChatGPT
Discussed in:
- Chapter 11: CNNs, Vision-Language Models