Glossary

PaLM-E

PaLM-E ("Pathways Language Model, Embodied") is the embodied multimodal model from Google Research introduced by Driess, Xia, Sajjadi et al. in March 2023. At 562B parameters (combining PaLM-540B with a 22B ViT vision encoder), PaLM-E was the largest VLM at the time of release and the first explicit fusion of a frontier LM with robot state inputs.

Architecture. PaLM-E embeds all modalities into the PaLM language model's token space:

  1. Image tokens. A ViT-22B encoder produces patch features; a learned projection maps each patch to a token embedding, which is inserted in the prompt at the position where the image appeared.
  2. State tokens. Continuous robot state vectors (joint angles, end-effector pose) are likewise projected into the token-embedding space via a learned MLP.
  3. Text tokens. Standard SentencePiece tokenisation.

The full input is a single sequence of mixed tokens that PaLM processes with its standard decoder-only attention. Outputs can be either natural language or token sequences that decode into robot actions or sub-goals.

Training. PaLM-E is trained on a mixture of:

  • PaLM's web-scale text corpus (preserved via mixed-batch training).
  • WebLI multilingual image-text data.
  • Robot manipulation trajectories (from RT-1 and TAMP-style mobile manipulation).
  • Visual question answering and captioning datasets.

Training uses standard cross-entropy on mixed batches, with no special embodied loss, the model learns embodiment by predicting the next token in trajectory data.

Capabilities. PaLM-E demonstrated three notable behaviours:

  1. Long-horizon mobile manipulation. Given an instruction like "bring me the rice chips from the drawer", PaLM-E plans the sequence of skills (navigate, open drawer, pick, navigate, place) and re-plans on failure, using vision to detect failure.
  2. Positive transfer. Co-training on web vision-language data, robot data, and language data improved the robot policy beyond what robot data alone provided, evidence for positive transfer across modalities, foreshadowing RT-2's findings.
  3. Retained language. Unlike fine-tuned VLAs that often forget language ability, PaLM-E retained PaLM's chain-of-thought reasoning and could discuss physics, write poetry, and answer trivia while embodied.

Significance. PaLM-E was an existence proof that frontier LMs can be embodied without catastrophic forgetting, and that scale matters: its 562B variant outperformed the 12B variant by wide margins on robot tasks, suggesting embodied capability inherits language scaling laws. Its conceptual lineage runs directly into RT-2, which made the action-as-token idea more efficient, and into Gemini Robotics, which is essentially PaLM-E re-implemented on Gemini.

Related terms: RT-1 and RT-2, Vision-Language Model, Embodied AI, Vision Transformer, Gemini Robotics

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).