PaLI ("Pathways Language and Image") is the multilingual vision-language model family from Google Research. The original PaLI was introduced by Xi Chen et al. in 2022; PaLI-X (2023) scaled to 55B parameters; PaLI-3 (Chen et al., 2023) reduced to 5B parameters while improving accuracy by changing the vision pretraining recipe.
Architecture. Unlike LLaVA-style decoder-only VLMs, PaLI uses an encoder-decoder transformer:
Vision encoder. A vision transformer (ViT) of varying scale (ViT-G/14 in PaLI-17B; ViT-22B in PaLI-X; SigLIP ViT-G/14 in PaLI-3). Image patches are encoded into dense features.
Multilingual text encoder. Tokens enter an mT5-style text encoder that attends to the image features via cross-attention.
Decoder. An autoregressive decoder generates output text, with cross-attention to the joint image-text encoder.
The image tokens and text tokens are concatenated at the encoder input, so the architecture is closer to a sequence-to-sequence machine-translation model than to a chat LM.
Training data. PaLI's distinctive feature is WebLI, a 10-billion-image multilingual web dataset spanning 109 languages, plus translated and re-captioned subsets. This scale of multilingual coverage is far beyond the English-only training of contemporary VLMs.
PaLI-3 contribution. PaLI-3 showed that contrastively pretrained vision encoders (SigLIP) outperform classification-pretrained ones (JFT) for VLM downstream tasks, even though classification-pretrained encoders win on ImageNet linear probes. This finding influenced the choice of CLIP-style encoders in nearly all subsequent VLMs. PaLI-3 at 5B parameters matched or beat PaLI-X at 55B on captioning, VQA, and document benchmarks.
Mathematical sketch. The training objective is the standard sequence-to-sequence cross-entropy:
$$\mathcal{L} = -\sum_{t=1}^{T} \log p(y_t \mid y_{\lt t}, I, x; \theta)$$
where $I$ is the image, $x$ is the input text (e.g. a question), and $y$ is the target text (e.g. an answer in any of 100+ languages).
Legacy. PaLI's siglip-style contrastive vision encoder and multilingual training corpus directly informed Gemini's vision component. The encoder-decoder architecture has fallen out of favour for chat applications but remains competitive for translation, captioning, and OCR-heavy benchmarks.
Related terms: Vision-Language Model, Vision Transformer, CLIP, Gemini Multimodal
Discussed in:
- Chapter 11: CNNs, Vision-Language Models