Pi-Zero, Glossary, Textbook of AI

$\pi_0$ ("pi-zero") is the general-purpose robot foundation model released by Physical Intelligence in October 2024 (Black, Brown, Driess et al.) and its successor $\pi_{0.5}$ in March 2025. $\pi_0$ is the first publicly demonstrated VLA to fold laundry, bus restaurant tables, and assemble cardboard boxes end-to-end, on multiple robot embodiments, at 50Hz.

Architecture. $\pi_0$ has two coupled components:

VLM backbone. A pretrained PaliGemma (3B-parameter Google open-weights VLM, descended from PaLI) encodes the current images and language instruction into a sequence of token embeddings.
Action expert. A separate transformer (300M parameters) processes the same token stream plus a noised action chunk and current robot state. It is trained with flow matching rather than diffusion or autoregression: given a target action sequence $a^* \in \mathbb{R}^{H \times d}$ and noise $\epsilon \sim \mathcal{N}(0, I)$, define $$a_t = (1 - t) \epsilon + t a^*$$ for $t \in [0, 1]$. The action expert predicts the velocity field $$v_\theta(a_t, t, \text{context}) \approx a^* - \epsilon$$ minimising $$\mathcal{L}_{\text{FM}} = \mathbb{E}_{a^*, \epsilon, t} \left\| v_\theta(a_t, t, c) - (a^* - \epsilon) \right\|_2^2.$$

At inference, integrate the velocity field from $t = 0$ to $t = 1$ in 10 steps, producing the next $H = 50$ action chunks ($\sim$1 second of motion) at 50 Hz.

Why flow matching for actions? Actions are continuous, multimodal (many valid trajectories for the same task), and require high frequency. Tokenised autoregressive heads (RT-2 style) emit one bin per action dimension per step, which is too slow and too coarse. Diffusion policies handle multimodality but need many denoising steps. Flow matching is fast (10 steps) and naturally multimodal.

Training data. $\pi_0$ is pretrained on $\sim$10,000 hours of robot demonstrations across 7 robot embodiments (UR5e, Franka, bimanual ARX, Trossen, mobile manipulators), then post-trained on task-specific datasets. Co-training preserves the VLM's web knowledge.

$\pi_{0.5}$ contribution. Extends $\pi_0$ with open-world generalisation: deployed in homes the robot has never entered, completing tasks like cleaning a kitchen with previously unseen objects and layouts. The improvement comes from larger and more diverse training data, plus a hierarchical inference scheme where the VLM emits a high-level plan first.

Significance. $\pi_0$ established flow matching as a competitive third option (alongside autoregression and diffusion) for action generation, and is widely cited as the strongest general-purpose VLA at the time of release. Several lab groups have replicated the recipe (HPT, RDT, GR00T).

Discussed in:

Chapter 16: Ethics & Safety, Embodied AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).