Glossary

Multimodal Model

A Multimodal Model integrates two or more data modalities—vision, language, audio, video, and others—into a single system. Unlike unimodal models that handle only text or only images, multimodal models can describe images in words, answer questions about visual content, generate images from text, or reason about relationships between textual and visual information.

The foundational work in vision-language alignment is CLIP (Contrastive Language-Image Pre-training, Radford et al. 2021). CLIP trains an image encoder and text encoder jointly on 400 million image-text pairs using a contrastive objective: maximise similarity between matching pairs while minimising it for non-matching pairs. The resulting shared embedding space enables zero-shot image classification (by comparing image embeddings to text embeddings of class descriptions), retrieval, and serves as the foundation for text-to-image systems.

Modern Large Multimodal Models (LMMs) like GPT-4V, Gemini, LLaVA, and Claude extend the LLM paradigm to accept images and other modalities alongside text. A typical architecture uses a pretrained vision encoder (often a CLIP ViT) to convert images into visual tokens, a projection layer mapping them into the language model's embedding space, and an LLM processing the interleaved sequence. Multimodal generation includes text-to-image (DALL·E 3, Stable Diffusion XL, Imagen), text-to-video (Sora), and text-to-audio. The trend toward "any-to-any" models that accept and produce any combination of modalities represents a natural endpoint of multimodal research.

Related terms: CLIP, Large Language Model, Vision Transformer, Diffusion Model

Discussed in:

Also defined in: Textbook of AI