RT-1 ("Robotic Transformer 1") and RT-2 are the Google DeepMind robot policies that established the vision-language-action (VLA) paradigm. Together they showed that the same transformer recipe used for language and vision can be retargeted at robot control.
RT-1 (Brohan et al. 2022). A 35M-parameter transformer trained on $\sim$130,000 demonstrations collected over 17 months across 13 robots in 3 office kitchens. Inputs: a short image history plus a language instruction. Outputs: discretised end-effector pose deltas (256 bins per dimension) plus a base velocity command. RT-1's headline result was generalisation to new objects, instructions, and rooms, exceeding prior task-specific baselines. RT-1 was open-sourced and seeded the Open X-Embodiment dataset.
RT-2 (Brohan et al. 2023). Replaces RT-1's purpose-built transformer with a fine-tuned PaLI-X (55B) or PaLM-E (12B) vision-language model. The crucial trick is to represent each action dimension as a text token in the vocabulary:
$$a_t = \text{detokenise}(\text{LM}(I_t, \ell_t)) \in \mathbb{R}^7$$
where the model emits, e.g., "1 128 91 241 5 101 127 3" (a base movement bit plus 6-DoF deltas) as a string, and a small parser maps tokens to floating-point actions.
Crucially, RT-2 is co-fine-tuned on robot trajectories and the original VLM's web-scale image-text data. This preserves the VLM's semantic knowledge: RT-2 succeeds on instructions like "put the strawberry into the correct bowl" (where "correct" means matching the strawberry's colour) or "pick up the extinct animal" (selecting a plastic dinosaur), tasks that contain no analogue in the robot training set.
Quantitative results. On a held-out evaluation, RT-2 doubled RT-1's success rate on unseen instructions (from $\sim$32% to $\sim$62%) and handled three classes of generalisation:
- Unseen objects. "Pick up the rabbit" with a stuffed rabbit never seen in robot data.
- Unseen backgrounds. Same task in a new room.
- Unseen instructions. Multi-hop reasoning, e.g. "move the apple to the soccer team" (selecting Real Madrid memorabilia).
Chain-of-thought robotics. A subsequent variant, RT-2-X with chain-of-thought prompting, asks the VLA to first emit a natural-language plan ("first I will grasp the bottle, then move it to the trash") before emitting actions. This improved success on long-horizon tasks.
Legacy. RT-2's "actions as text tokens" insight is the conceptual foundation of OpenVLA, Gemini Robotics, $\pi_0$ (which prefers continuous flow-matching heads but inherits the VLM backbone), and Helix. RT-1's data was folded into Open X-Embodiment, the largest cross-embodiment robot dataset to date.
Related terms: Embodied AI, OpenVLA, Gemini Robotics, PaLM-E, PaLI and PaLI-3, Vision-Language Model, Pi-Zero
Discussed in:
- Chapter 16: Ethics & Safety, Embodied AI