Vision-Language Model, Glossary, Textbook of AI

A vision-language model (VLM) is a neural network that takes images (and optionally video) together with text as input and produces text (and optionally images) as output. VLMs unify computer vision and natural language processing by mapping pixels and tokens into a shared representation that a transformer can attend over. They have replaced earlier task-specific captioning, VQA, and OCR systems, and underpin modern multimodal assistants such as GPT-4V, Claude Vision, and Gemini.

Architectural families. VLMs cluster into three architectural patterns:

Cross-attention bridge (Flamingo, BLIP-2, IDEFICS). A pretrained language model is frozen, and gated cross-attention layers are inserted between its blocks. The cross-attention attends from the LM hidden states (queries) to a small set of image tokens (keys/values) produced by a vision encoder. This preserves the LM's capabilities and adds vision as a side channel.
Patch-as-token projection (LLaVA, MiniGPT-4, Qwen-VL). A vision encoder (typically a vision transformer trained with CLIP) produces a sequence of patch embeddings; a small MLP or linear projection maps each patch to the LM's token-embedding dimension; these "image tokens" are concatenated with text tokens and fed into a standard decoder-only transformer. The LM is fine-tuned end-to-end. This is the dominant open-source recipe.
End-to-end native multimodal (GPT-4o, Gemini, Chameleon). The model is pretrained from scratch on interleaved image and text tokens (and audio tokens for GPT-4o). There is no separate vision encoder; image patches are tokenised with a learned VQ-VAE or similar and processed by the same transformer that handles text. This gives the strongest cross-modal generation but requires far more compute.

Mathematical view. Let $I$ be an image and $x_{1:T}$ a text prefix. The VLM defines

$$p(x_{T+1:T+L} \mid I, x_{1:T}) = \prod_{t=T+1}^{T+L} p(x_t \mid x_{\lt t}, I; \theta)$$

where the conditioning on $I$ is implemented via the chosen architectural bridge. Training combines language modelling loss on text-only corpora, image–caption contrastive or generative loss, and instruction-following supervised fine-tuning on multimodal datasets such as LLaVA-Instruct.

Capabilities and benchmarks. Modern VLMs handle visual question answering (VQAv2, GQA), document understanding (DocVQA, ChartQA), OCR-free reading (TextVQA), spatial reasoning (RefCOCO), and screenshot-grounded GUI control. Frontier benchmarks include MMMU (college-level multimodal reasoning) and MathVista (visual mathematics).

Limitations. VLMs hallucinate objects that are not present, struggle with precise counting and spatial relations, and remain brittle on adversarial or out-of-distribution images. Resolution is typically capped (224–1344 pixels), so fine-grained text in images requires tiling or specialised high-resolution encoders.

Discussed in:

Chapter 11: CNNs, Vision-Language Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).