PaLM-E ("Pathways Language Model, Embodied") is the embodied multimodal model from Google Research introduced by Driess, Xia, Sajjadi et al. in March 2023. At 562B parameters (combining PaLM-540B with a 22B ViT vision encoder), PaLM-E was the largest VLM at the time of release and the first explicit fusion of a frontier LM with robot state inputs.
Architecture. PaLM-E embeds all modalities into the PaLM language model's token space:
- Image tokens. A ViT-22B encoder produces patch features; a learned projection maps each patch to a token embedding, which is inserted in the prompt at the position where the image appeared.
- State tokens. Continuous robot state vectors (joint angles, end-effector pose) are likewise projected into the token-embedding space via a learned MLP.
- Text tokens. Standard SentencePiece tokenisation.
The full input is a single sequence of mixed tokens that PaLM processes with its standard decoder-only attention. Outputs can be either natural language or token sequences that decode into robot actions or sub-goals.
Training. PaLM-E is trained on a mixture of:
- PaLM's web-scale text corpus (preserved via mixed-batch training).
- WebLI multilingual image-text data.
- Robot manipulation trajectories (from RT-1 and TAMP-style mobile manipulation).
- Visual question answering and captioning datasets.
Training uses standard cross-entropy on mixed batches, with no special embodied loss, the model learns embodiment by predicting the next token in trajectory data.
Capabilities. PaLM-E demonstrated three notable behaviours:
- Long-horizon mobile manipulation. Given an instruction like "bring me the rice chips from the drawer", PaLM-E plans the sequence of skills (navigate, open drawer, pick, navigate, place) and re-plans on failure, using vision to detect failure.
- Positive transfer. Co-training on web vision-language data, robot data, and language data improved the robot policy beyond what robot data alone provided, evidence for positive transfer across modalities, foreshadowing RT-2's findings.
- Retained language. Unlike fine-tuned VLAs that often forget language ability, PaLM-E retained PaLM's chain-of-thought reasoning and could discuss physics, write poetry, and answer trivia while embodied.
Significance. PaLM-E was an existence proof that frontier LMs can be embodied without catastrophic forgetting, and that scale matters: its 562B variant outperformed the 12B variant by wide margins on robot tasks, suggesting embodied capability inherits language scaling laws. Its conceptual lineage runs directly into RT-2, which made the action-as-token idea more efficient, and into Gemini Robotics, which is essentially PaLM-E re-implemented on Gemini.
Related terms: RT-1 and RT-2, Vision-Language Model, Embodied AI, Vision Transformer, Gemini Robotics
Discussed in:
- Chapter 16: Ethics & Safety, Embodied AI