Modern AI: 15.15   Multimodal models

Dr Chris Paton

15.15 Multimodal models

The frontier of artificial intelligence in 2026 is no longer purely textual. The systems we now use day-to-day, GPT-4o, Claude 3.5 Sonnet, Gemini 1.5, and their open-weight counterparts, accept images, audio, video, and text as input, and several of them produce more than one of these as output. A user can paste a screenshot of a chart, a snippet of a research paper, and a recorded voice memo into the same conversation, and the model will reason across all three. This is a significant practical capability, but architecturally it is less of a revolution than it may appear. The transformer is a sequence model. If you can turn a modality into a sequence of vectors, you can shove it through the same blocks of attention and feed-forward layers that handle words. The hard work has shifted from inventing new architectures to engineering good tokenisers for each modality, gathering aligned data, and training without one modality drowning out the others.

In §15.14 we saw how retrieval-augmented generation grafts an external knowledge store onto a language model. Multimodal modelling is in some sense the dual problem: rather than expanding the context the model can draw on, we expand the input space it can perceive. The two trends combine naturally, a multimodal RAG system can index PDF pages as images, audio transcripts, and frame-level video captions all at once, and they share the property that the underlying transformer is essentially unchanged. What follows is a tour of the families that matter: vision-language models, natively multimodal frontier systems, the tokenisation tricks that make images and audio look like text, and the very young field of video generation.

It is worth keeping a running mental model of the engineering challenge. For text alone, a frontier lab spends the bulk of its budget on data curation, distributed training infrastructure, and post-training alignment. For multimodal training, every one of those line items multiplies. Image-text alignment requires hundreds of millions to billions of curated pairs, with dedup, safety filtering, and careful balancing across languages and domains. Audio data has to be transcribed, often by an earlier model, before it can be paired with text. Video data is the worst of both worlds: enormous in raw size, expensive to caption, and frequently encumbered by copyright. The compute cost of a forward pass also grows: a single high-resolution image consumes the same context budget as several pages of text, so multimodal models are routinely run with longer context windows and more aggressive KV-cache optimisations than their text-only siblings.

Vision-language models

The first generation of practical vision-language systems descended from CLIP, introduced by Radford and colleagues at OpenAI in 2021. CLIP trains two encoders in parallel, a vision transformer for images, a text transformer for captions, and aligns them through a contrastive objective: the cosine similarity between an image and its true caption should be higher than the similarity between that image and any other caption in the batch. After training on roughly four hundred million image-text pairs scraped from the web, CLIP exhibited the property that made it foundational: zero-shot classification. To classify an image among a thousand categories, you embed each category name as a sentence ("a photo of a goldfinch", "a photo of a Bengal tiger") and pick whichever embedding lands closest to the image's. No fine-tuning, no labelled examples. CLIP also gave generative models, particularly Stable Diffusion, a way to condition on text without training a bespoke caption encoder.

CLIP's representation space remains the workhorse of open-source multimodal AI. The next step was to plug it into a language model so the system could not only classify but also describe, reason about, and answer questions about images. LLaVA (Liu et al., 2023) was the canonical recipe: take a pre-trained CLIP vision encoder, take a pre-trained LLaMA language model, train a small linear projection between the two on a synthetic instruction-tuning dataset of image-question-answer triples, and fine-tune end-to-end. The result was a system that could read a chart and explain it, count objects in a photograph, and roughly transcribe documents. It was wonky compared to GPT-4V, hallucinations were frequent and OCR was crude, but it showed that the projection-layer architecture was sufficient. Its descendants (LLaVA-1.5, LLaVA-NeXT, Llama-3 vision, Qwen-VL, InternVL, Pixtral) refined the recipe with bigger encoders, higher input resolution, multi-image inputs, and far more training data, until by 2025 the open-source frontier was within a few points of GPT-4V on standard visual question-answering benchmarks. Document and chart understanding remains the most reliable application: the layout-aware VLMs power the screenshot-driven "computer use" agents that Anthropic and OpenAI shipped in late 2024 and 2025.

The training pipeline for a VLM is itself a microcosm of modern AI. Stage one is contrastive pre-training of the vision encoder, either lifted from CLIP or trained afresh on a curated corpus of image-text pairs. Stage two is alignment: the projection layer is trained alone, with the LLM frozen, on a relatively small set of carefully curated captions, so that image embeddings settle into a region of the language model's input space that the existing weights can interpret. Stage three is instruction-tuning on diverse visual tasks, counting, OCR, chart reading, mathematical figures, screenshots, medical images, usually with the LLM unfrozen and fine-tuned alongside the projection. Stage four, when budgets allow, is preference optimisation with human or synthetic feedback to discourage hallucinated objects and confabulated text. Each stage has its own data scaling law and its own failure mode, and the recipes of the leading labs differ mainly in what they pour in at stages two and three.

Native multimodal

The successor architecture, pioneered with Gemini in 2023 and pushed further by GPT-4o in 2024, abandons the bolt-on encoder. Native multimodal models are pre-trained from the start on interleaved sequences of text tokens, image patches, and audio frames, with a single transformer learning to attend across all of them. The economic argument is shared compute: one set of weights, one loss, one forward pass, instead of a duplicated encoder for each modality. The scientific argument is grounding: a model that has seen an image of a giraffe alongside a paragraph describing one should learn richer associations than two separate models stitched together. Empirically, the native systems are stronger on tasks that genuinely require fusion, describing the emotion of a piece of music, transcribing a handwritten chemistry equation, narrating a chess game frame by frame, and they handle real-time conversational audio with sub-300 millisecond latency, which the cascaded ASR-LLM-TTS pipelines simply cannot match. The cost is engineering complexity: training mixes have to be balanced carefully or one modality dominates, and the tokenisers for audio and images are themselves substantial subsystems. By April 2026 the leading frontier systems, GPT-4o successors, Claude 3.5 Sonnet, Gemini 1.5 and 2.0, are all native to varying degrees, and "vision capability" is no longer a differentiator the way it was in 2023.

A subtle consequence of native training is that emergent behaviours appear at the boundaries between modalities. Gemini-class models can solve mathematical problems written by hand on a whiteboard at near-textual accuracy, because handwritten symbols and printed equations occupy nearby regions of the joint embedding space. GPT-4o's voice mode can sing, whisper, and switch accents on demand, because the audio token stream learned during pre-training was conditioned on text describing exactly those properties. The danger is that mixed training can also blur capabilities: an image-heavy training mix can blunt the model's coding performance, and an audio-heavy mix can degrade reading comprehension. Frontier labs spend significant compute on ablation studies just to find a mixture that does not regress any one capability while improving the others, and the recipes are closely guarded.

Image patches as tokens

The trick that made vision tractable inside a transformer is conceptually trivial. The vision transformer (Dosovitskiy et al., 2020) divides an input image into a grid of fixed-size patches, 16 by 16 pixels is canonical, though modern frontier models use larger or variable-resolution schemes, flattens each patch into a vector, and applies a single linear projection to map it into the model's embedding dimension. A learned positional embedding is added so the transformer knows where each patch sits in the grid, and the resulting sequence is fed to the same self-attention blocks the language model uses. A 224-by-224 image at 16-pixel patches becomes 196 tokens; a 1024-by-1024 image becomes about four thousand. This is why high-resolution image input is expensive: a single screenshot can cost the same as a long paragraph of text. Modern systems use various tricks to compensate, adaptive patch sizes, image pyramids, perceiver-style cross-attention bottlenecks that compress many patches into a small fixed budget, but the basic ViT recipe remains the standard front end. Cross-modal training works because, after the projection, image patches and text tokens are just vectors in the same space; attention does not know or care which is which.

Audio

Audio is harder to handle than vision because it carries information at multiple time-scales simultaneously: phonemes at tens of milliseconds, words at hundreds, prosody and emotion across seconds, and discourse-level structure across minutes. Audio entered the transformer era through Whisper, released by OpenAI in late 2022. Whisper converts raw waveform into an 80-channel log-mel spectrogram, a two-dimensional representation of energy across frequency bands over time, and feeds a downsampled version of that spectrogram, again chunked into patch-like tokens, to an encoder-decoder transformer. The model was trained on 680,000 hours of weakly-supervised speech scraped from the web, covering 99 languages, and it remains the de facto open-source automatic speech recognition system. For generation the picture is more varied. AudioLM, MusicGen, and their successors generate audio autoregressively in a discrete token space provided by a neural codec, typically SoundStream or EnCodec, which compresses 24 kHz audio into roughly 75 tokens per second across several quantiser levels. The codec is a small autoencoder trained with a vector-quantisation bottleneck; once you have it, audio generation is structurally identical to text generation. Speech-to-speech models such as those underlying GPT-4o's voice mode use the same idea but jointly model text and audio tokens, which lets them preserve emotion, speaker identity, and prosody that a cascaded pipeline would discard. Voice cloning from five-second samples is now routine; the abuse risk has driven most providers to watermark synthetic audio and to require consent flows before cloning a target voice. The arms race between detectors and generators is unresolved, and several jurisdictions have tightened election-related rules in response to high-profile incidents in 2024 and 2025.

Video

Video generation is the youngest of the multimodal frontiers, and the least mature. Sora, announced by OpenAI in February 2024, set the template: a diffusion transformer operating not on individual frames but on three-dimensional space-time patches extracted from a learned video tokeniser. The model is trained to denoise these patches conditioned on a text prompt, then decoded back to pixels. Veo and Veo 2 from Google DeepMind followed in 2024, integrated with Gemini and competitive on most quality benchmarks. DeepMind's Genie 2 and Genie 3 took a different angle, training on game footage to produce interactive video in which a user's keypresses steer the next frame, effectively a learned game engine. Open-source efforts, HunyuanVideo, Mochi, LTX-Video, Wan, trail the closed labs by perhaps six to twelve months on subject coherence and prompt adherence. Two technical problems remain unresolved. Physical plausibility: Sora-class models still produce cats with five legs in fast motion and ropes that pass through themselves, because the diffusion process has no built-in physics. Long-horizon consistency: subjects, lighting and background drift over the course of a minute, because the model has no persistent state across the chunks of video it generates. The compute cost is dramatic, minutes of GPU time per minute of video, which keeps these systems out of real-time interactive use for now.

Beyond pure generation, video models are increasingly being treated as approximate world models: simulators trained on enough footage that they capture some of the dynamics of the physical world, useful for planning by robotic agents or as test environments for embodied policies. The line between a generative video model, an interactive game engine, and a learned physics simulator has begun to blur, and several research groups are betting that this convergence is where the next plateau of capability will arrive. Whether the resulting "world" is faithful enough for a robot to plan in, rather than merely visually pleasing, is the question that will be settled empirically in the next few years.

What you should take away

Modern frontier models are multimodal almost by default. Vision is universal, audio is converging, and video is the active frontier.
The architecture is mostly unchanged. Patches and frames become tokens, tokens go through a transformer, and the model neither knows nor cares which modality each token came from.
CLIP's contrastive image-text alignment underlies most open-source visual reasoning, including the conditioning of diffusion models such as Stable Diffusion.
Whisper is the standard speech-recognition workhorse; audio generation runs through neural codecs (SoundStream, EnCodec) that turn waveforms into discrete tokens.
Sora and its peers prove that long, coherent video generation is possible, but physical plausibility, long-horizon consistency, and generation cost remain open problems that will define the next two years.