LLaVA (Large Language and Vision Assistant) is the open-source vision-language model released by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee in 2023, with subsequent improved versions (LLaVA-1.5, LLaVA-NeXT, LLaVA-OneVision). It established the patch-as-token recipe that the open-source community adopted as the default: a frozen CLIP image encoder, a small projection layer, and a fine-tuned chat-tuned language model.
Architecture. Given an input image $I$:
- The CLIP ViT-L/14 encoder produces patch features $Z_v \in \mathbb{R}^{N \times d_v}$ where $N \approx 576$ patches at $336 \times 336$ resolution.
- A two-layer MLP projection $W$ maps each patch to the language model's token-embedding dimension: $$H_v = W \cdot Z_v \in \mathbb{R}^{N \times d_{\text{lm}}}.$$
- The image tokens $H_v$ are concatenated with text-token embeddings $H_q$ and passed to Vicuna, a chat-tuned Llama variant.
- Generation is standard autoregressive next-token prediction.
Training recipe. LLaVA introduced a two-stage protocol that became canonical:
Stage 1: feature alignment. Freeze both CLIP and the language model; train only $W$ on $\sim$558k image–caption pairs (filtered subset of CC3M/LAION). The MLP learns to map CLIP features into the LM's embedding manifold.
Stage 2: instruction tuning. Unfreeze the LM (and optionally the projection); train on the LLaVA-Instruct-150K dataset, generated by prompting GPT-4 (text-only) with COCO captions and bounding-box metadata to produce conversation, detailed-description, and complex-reasoning examples. This synthetic dataset was a methodological contribution: it showed instruction-tuning data could be bootstrapped from a stronger text model rather than collected from humans.
Variants.
- LLaVA-1.5 (Liu et al. 2023b): two-layer MLP instead of single linear, higher resolution, academic-VQA mixture; closes much of the gap to proprietary VLMs at $\sim$13B parameters.
- LLaVA-NeXT (2024): dynamic high-resolution tiling (up to $672 \times 672$), stronger reasoning, multi-image support.
- LLaVA-OneVision (2024): unified handling of single image, multi-image, and video.
Significance. LLaVA's training cost (about $300 of GPU time for the original 7B model) demonstrated that competitive multimodal capability could be added to an existing LM with very little new compute, sparking a flood of derivative work (Qwen-VL, MiniCPM-V, InternVL, Phi-Vision, Idefics2). Its architectural simplicity, essentially "stick image tokens at the front of the prompt", remains the default for open multimodal research.
Related terms: CLIP, Vision Transformer, Vision-Language Model, Transformer, Flamingo
Discussed in:
- Chapter 11: CNNs, Vision-Language Models