Diffusion policy, introduced by Chi et al. in 2023, applies the denoising-diffusion generative modelling framework to robot action generation. Rather than predicting a single action $a$ given an observation $o$ (as standard behaviour cloning does), the policy generates samples from the conditional action distribution $\pi(a \mid o)$ by iteratively denoising Gaussian noise.
The setup: an expert demonstrates trajectories, producing a dataset of (observation, action-sequence) pairs. The action is typically a short horizon (8--16 timesteps) of low-dimensional motor commands or end-effector poses, denoted $a^0$. A forward noising process gradually corrupts $a^0$ over $K$ steps:
$$a^k = \sqrt{\bar\alpha_k}\, a^0 + \sqrt{1 - \bar\alpha_k}\, \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I)$$
A neural network $\epsilon_\theta(a^k, k, o)$, typically a U-Net or transformer conditioned on the observation $o$, is trained to predict the noise added at step $k$:
$$\mathcal{L}(\theta) = \mathbb{E}_{a^0, \epsilon, k, o}\!\left[\|\epsilon - \epsilon_\theta(a^k, k, o)\|^2\right]$$
At inference, a noisy action $a^K \sim \mathcal{N}(0, I)$ is iteratively denoised with the learned reverse process to produce a clean action sequence $a^0$ conditioned on the current observation. DDIM-style samplers reduce the number of denoising steps from hundreds to roughly ten, making real-time control feasible.
The key advantage over standard behaviour cloning is robustness to multimodal action distributions. If the demonstrations show two valid ways to push a block (left side or right side), an MSE-trained policy averages them and produces nothing useful, and a Gaussian policy collapses to one mode. A diffusion policy faithfully represents both modes, sampling left or right depending on noise; the resulting behaviour generalises gracefully to novel configurations because the policy has captured the full distribution rather than a single mean. A second advantage is action-sequence generation: predicting a horizon of actions at once provides smoother, more coherent trajectories than per-step prediction, similar to the temporal abstraction of MPC.
Diffusion policy beat prior behaviour-cloning methods (LSTM-GMM, IBC, BET) by 47% on average across 11 manipulation benchmarks in the original paper, including dexterous tasks like inserting blocks, scooping, and pouring. Subsequent work extended the idea: 3D Diffusion Policy uses point-cloud observations; RDT-1B and $\pi_0$ scale diffusion policies to billion-parameter foundation models trained on web-scale robot data; Diffusion Forcing unifies sequence-level and step-level diffusion. Across this literature, the diffusion-policy formulation has become the dominant action representation in modern imitation learning for robots.
The conceptual connection to image-generation diffusion models is direct: the same training and sampling machinery, applied to action sequences instead of pixels. The connection to model predictive control is also direct: both produce open-loop action plans replanned at every step. The combination of diffusion-shaped policies, transformer encoders, and large-scale teleoperation datasets is the architectural backbone of the current wave of generalist robot policies.
Related terms: Behaviour Cloning, Model Predictive Control, World Model, Reinforcement Learning
Discussed in:
- Chapter 12: Sequence Models, Robotics and Control