Behaviour Cloning, Glossary, Textbook of AI

Behaviour cloning (BC) is the simplest imitation-learning method: collect a dataset of (state, action) pairs $\mathcal{D} = \{(s_i, a_i)\}$ from an expert demonstrator, and fit a parameterised policy $\pi_\theta(a \mid s)$ by maximum likelihood:

$$\theta^* = \arg\max_\theta \sum_{(s, a) \in \mathcal{D}} \log \pi_\theta(a \mid s)$$

For continuous actions with a Gaussian policy this reduces to mean-squared error between predicted and expert actions; for discrete actions it is cross-entropy. The reduction to standard supervised learning is what makes BC attractive: any classification or regression model , MLPs, CNNs, transformers, diffusion models, can be trained as a policy, with off-the-shelf optimisers and well-understood scaling laws.

BC is widely used because it works surprisingly well when the data is sufficient and the task is not too sensitive to errors. The current generation of generalist robot policies, RT-1, RT-2, OpenVLA, Octo, $\pi_0$, are all fundamentally behaviour-cloned from large datasets of human teleoperation, with diffusion-policy or autoregressive token-prediction action heads. In autonomous driving, ALVINN (1989) was a behaviour-cloned neural network that could steer a van; modern Tesla and Waymo perception-to-action stacks descend from this lineage. In games, AlphaStar's initial policy was behaviour-cloned from human StarCraft replays before being fine-tuned with self-play.

The central failure mode is covariate shift, also called compounding error or distribution shift. The policy is trained on states drawn from the expert's distribution, but at test time it visits states drawn from its own distribution. Small action errors push the agent into states that were rare or absent in the training data, where the policy is now extrapolating. Errors compound quadratically with horizon: Ross and Bagnell (2010) showed that a BC policy with one-step error $\epsilon$ accrues total cost $\mathcal{O}(\epsilon T^2)$ over $T$ steps, compared to $\mathcal{O}(\epsilon T)$ for an interactive expert. A car that drifts an inch off-centre per second is fine for ten seconds and on the verge in a minute.

The standard fix is DAgger (Dataset Aggregation, Ross, Gordon, Bagnell 2011). DAgger iterates: train a policy on the current dataset; roll out the policy in the environment; for each visited state, query the expert for the correct action; aggregate the new (state, expert-action) pairs into the dataset; retrain. Because the dataset eventually covers states the policy actually visits, the covariate shift vanishes. DAgger gives no-regret guarantees and reduces compounding error to $\mathcal{O}(\epsilon T)$, but it requires an interactive expert, which is often impractical (a human teleoperator cannot be available for every state a policy visits).

Other mitigations include action chunking (predict a short horizon of actions to reduce per-step error frequency, used in ACT and diffusion policy); explicit policy regularisation (penalise deviations from training-state distribution); inverse reinforcement learning (recover a reward function from demonstrations and then optimise it, more robust to compounding error); and offline reinforcement learning (use the demonstrations as a value-learning dataset rather than direct supervision). Behaviour cloning remains the dominant entry point because it is simple, scalable, and excellent for warm-starting more sophisticated methods.

Discussed in:

Chapter 12: Sequence Models, Robotics and Control

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).