Glossary

PaLI and PaLI-3

PaLI ("Pathways Language and Image") is the multilingual vision-language model family from Google Research. The original PaLI was introduced by Xi Chen et al. in 2022; PaLI-X (2023) scaled to 55B parameters; PaLI-3 (Chen et al., 2023) reduced to 5B parameters while improving accuracy by changing the vision pretraining recipe.

Architecture. Unlike LLaVA-style decoder-only VLMs, PaLI uses an encoder-decoder transformer:

  1. Vision encoder. A vision transformer (ViT) of varying scale (ViT-G/14 in PaLI-17B; ViT-22B in PaLI-X; SigLIP ViT-G/14 in PaLI-3). Image patches are encoded into dense features.

  2. Multilingual text encoder. Tokens enter an mT5-style text encoder that attends to the image features via cross-attention.

  3. Decoder. An autoregressive decoder generates output text, with cross-attention to the joint image-text encoder.

The image tokens and text tokens are concatenated at the encoder input, so the architecture is closer to a sequence-to-sequence machine-translation model than to a chat LM.

Training data. PaLI's distinctive feature is WebLI, a 10-billion-image multilingual web dataset spanning 109 languages, plus translated and re-captioned subsets. This scale of multilingual coverage is far beyond the English-only training of contemporary VLMs.

PaLI-3 contribution. PaLI-3 showed that contrastively pretrained vision encoders (SigLIP) outperform classification-pretrained ones (JFT) for VLM downstream tasks, even though classification-pretrained encoders win on ImageNet linear probes. This finding influenced the choice of CLIP-style encoders in nearly all subsequent VLMs. PaLI-3 at 5B parameters matched or beat PaLI-X at 55B on captioning, VQA, and document benchmarks.

Mathematical sketch. The training objective is the standard sequence-to-sequence cross-entropy:

$$\mathcal{L} = -\sum_{t=1}^{T} \log p(y_t \mid y_{\lt t}, I, x; \theta)$$

where $I$ is the image, $x$ is the input text (e.g. a question), and $y$ is the target text (e.g. an answer in any of 100+ languages).

Legacy. PaLI's siglip-style contrastive vision encoder and multilingual training corpus directly informed Gemini's vision component. The encoder-decoder architecture has fallen out of favour for chat applications but remains competitive for translation, captioning, and OCR-heavy benchmarks.

Related terms: Vision-Language Model, Vision Transformer, CLIP, Gemini Multimodal

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).