Embodied AI is the study of intelligent agents that have a body, real or simulated, through which they perceive and act on an environment. The term re-entered mainstream AI around 2017 with simulated household environments (AI2-THOR, Habitat) and re-emerged in the foundation-model era (2022 onwards) when researchers showed that the same transformer architectures that power vision-language models could be conditioned to emit robot actions.
Action as a modality. The unifying idea of modern embodied AI is to treat low-level robot commands, joint angles, end-effector poses, gripper open/close, base velocities, as additional tokens in a transformer's vocabulary. A model trained on (image, text instruction, action) triples then generates actions autoregressively just as a language model generates text:
$$p(a_{1:H} \mid I, \ell) = \prod_{t=1}^{H} p(a_t \mid a_{\lt t}, I, \ell)$$
where $I$ is the current visual observation, $\ell$ is the natural-language instruction, and $a_{1:H}$ is the action chunk over the next $H$ steps. This formulation, popularised by RT-1 and RT-2, is now called a vision-language-action (VLA) model.
Robot foundation models. A robot foundation model is a single neural network deployed across many robot embodiments (arms, mobile bases, humanoids), tasks (pick-and-place, folding laundry, opening doors), and environments (kitchens, factories, homes). The 2024-2025 wave of such models includes:
- Physical Intelligence's $\pi_0$ and $\pi_{0.5}$. Flow-matching action heads on a VLM backbone.
- Google DeepMind's RT-2, RT-X, and Gemini Robotics. VLA on Gemini.
- Stanford/Berkeley's OpenVLA. Open-source VLA built on Llama-2 + Prismatic vision encoder.
- Figure AI's Helix. Two-system architecture (slow planner, fast policy) for humanoids.
- Tesla's FSD-style end-to-end driving. Embodied AI for autonomous vehicles.
Co-training with web data. A central insight from RT-2 is that co-training on robot trajectories and internet vision-language data produces emergent generalisation: the robot can follow novel instructions involving objects it never manipulated in robot data ("pick up the extinct animal"), because the language model component bridges the conceptual gap.
Open challenges.
- Data. Robot data is expensive and hard to scale; the Open X-Embodiment dataset (2024) collects $\sim$1M trajectories across 22 robot embodiments to mitigate this.
- Reliability. VLAs achieve $\sim$80% success on familiar tasks but degrade sharply with distribution shift.
- Latency. Generating dense action sequences at 30+ Hz from a 7B-parameter VLM requires action chunking, distillation, or specialised inference hardware.
- Simulation-to-real gap. Photorealistic simulators (Isaac Sim, Genesis) help but do not close the gap fully.
Embodied AI is widely seen as the next domain after language and vision where foundation models will dominate, with active investment from Google, Tesla, Figure, Physical Intelligence, NVIDIA, and Boston Dynamics.
Related terms: RT-1 and RT-2, OpenVLA, Pi-Zero, Gemini Robotics, Helix, Vision-Language Model, Reinforcement Learning
Discussed in:
- Chapter 16: Ethics & Safety, Embodied AI