Glossary

OpenVLA

OpenVLA is the open-source vision-language-action (VLA) model released by Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti and colleagues at Stanford, UC Berkeley, Google DeepMind, MIT and the Toyota Research Institute in June 2024. It was the first frontier-quality VLA with weights, code, and a fine-tuning pipeline released for community use.

Architecture. OpenVLA inherits the Prismatic VLM recipe:

  1. Vision encoder. Concatenated DINOv2 and SigLIP ViT-L features ($ \approx 1024$-dimensional). DINOv2 contributes localisation; SigLIP contributes semantics.
  2. Projector. A two-layer MLP maps fused vision features to Llama-2-7B's token-embedding dimension.
  3. Language model. Llama-2-7B (chat-tuned variant), the action policy itself.
  4. Action head. Following RT-2, action dimensions are quantised into 256 bins each and represented by overwriting the 256 least-used tokens in Llama's vocabulary. The model emits actions as text tokens that a small parser converts to a 7-DoF action chunk.

Training. OpenVLA is trained on $\sim$970,000 robot trajectories from the Open X-Embodiment dataset (the 2024 community pooling of robot data across 22 embodiments, 21 institutions, 311 scenes). Training runs for $\sim$27 epochs over $\sim$8 days on 64 A100s.

Mathematical objective. Cross-entropy on action-token prediction:

$$\mathcal{L}(\theta) = -\sum_{t=1}^{H} \log p_\theta(a_t \mid I_t, \ell)$$

where $a_t$ is a sequence of 7 quantised bin-tokens (one per action dimension).

Performance. Out-of-the-box, OpenVLA matches or exceeds RT-2-X (55B parameters, closed-source) on the WidowX and Google Robot evaluation suites, despite being 7$\times$ smaller. With LoRA fine-tuning on $\sim$10–50 demonstrations, it adapts to new tasks within an hour on a single GPU.

Practical impact. OpenVLA became the default open-source baseline for embodied AI research overnight. It enabled:

  • Rapid replication and ablation of RT-2-style results outside Google.
  • LoRA fine-tuning on consumer hardware (24GB VRAM).
  • Direct comparison with closed-source VLAs like Gemini Robotics and $\pi_0$.

Successors. OpenVLA-OFT (October 2024) replaces the discrete-token head with a continuous action head and improves throughput; CogACT introduces diffusion-based action heads; $\pi_0$ uses flow-matching. The trend is clear: discrete tokenisation was a useful bootstrap, but continuous action representations win for high-frequency control.

Related terms: RT-1 and RT-2, Pi-Zero, Embodied AI, Vision-Language Model, CLIP, Gemini Robotics

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).