Gemini is Google DeepMind's flagship model family, first released in December 2023 (Gemini 1.0 in Ultra, Pro, and Nano sizes) and extended through Gemini 1.5 (2024, with the 2M-token context window), Gemini 2.0 (late 2024, with native tool use), and Gemini 2.5 (2025, with thinking modes). Unlike VLMs that bolt vision onto an existing LM, Gemini is described in its technical report as natively multimodal: pretrained from the start on interleaved text, image, audio, and video tokens.
Native multimodal architecture. All four modalities are tokenised into a shared vocabulary:
- Text via a SentencePiece BPE tokeniser.
- Images via a ViT-style patch encoder producing dense visual tokens.
- Audio via a learned audio codec at $\sim$25 frames per second.
- Video as a sequence of image frames plus audio tokens.
These tokens are fed into a single decoder-only transformer with a mixture-of-experts (MoE) variant in Gemini 1.5 and later. Cross-modal attention happens automatically because all modalities live in the same token stream.
Long context. Gemini 1.5 Pro shipped with a 1M-token context window, later extended to 2M. The technical report demonstrates near-perfect "needle-in-a-haystack" recall over a 10-hour video or a complete 700,000-line codebase. The mechanism combines ring attention, MoE sparsity, and aggressive KV-cache compression; full architectural details are not public.
Capabilities unique to native multimodality.
- Process a 1-hour video and answer questions about a single 1-second event.
- Translate a 200-page grammar of a low-resource language (Kalamang) and use it for translation, demonstrating in-context language acquisition.
- Answer audio questions while preserving prosody for spoken responses.
Variants.
- Ultra / Pro / Flash / Nano. Decreasing capability and cost. Nano runs on-device on Pixel phones.
- Gemini Robotics (2025). Specialised VLA variant for robot control; see separate entry.
- Gemini 2.5 Deep Think. Reasoning mode using extended chain-of-thought, comparable to OpenAI's o1/o3 line.
Significance. Gemini's main contribution to the field is empirical: it showed that a single transformer pretrained on all four modalities from scratch can match or exceed cascaded architectures, and that 1M+ context windows are practically deployable. Together with GPT-4o, it has shifted the open-source target from "VLM" to "natively multimodal model".
Related terms: Gemini 2.x, Vision-Language Model, GPT-4V and GPT-4o Vision, Transformer, Mixture of Experts
Discussed in:
- Chapter 11: CNNs, Vision-Language Models