World Model, Glossary, Textbook of AI

A world model is a learned generative model of an environment's dynamics that an agent uses to plan or train policies in imagination rather than only by interacting with the real environment. The intuition is that humans plan future actions by mentally simulating outcomes, and an artificial agent can do the same if it has a sufficiently good predictive model.

The Ha and Schmidhuber 2018 paper "World Models" introduced the modern framing. A vision module (variational autoencoder) compresses each observation $o_t$ into a low-dimensional latent $z_t$. A memory module (mixture-density RNN) predicts the next latent given the current latent and action: $p(z_{t+1} \mid z_t, a_t, h_t)$ where $h_t$ is the RNN state. A simple linear or evolutionary controller selects actions from $z_t$ and $h_t$. Critically, the controller can be trained entirely inside the learned dream, with no further interaction with the real environment, and the resulting policies transfer back. Ha and Schmidhuber demonstrated this on car racing and the Doom take-cover environment.

Hafner's Dreamer series scaled the idea. Dreamer V1 (2020) introduced the recurrent state-space model (RSSM) with both deterministic and stochastic latent components, and trained the policy with policy gradients backpropagated through the imagined rollouts (the differentiable world model lets gradients flow through dynamics). Dreamer V2 (2021) used categorical latents and matched human-level Atari performance. Dreamer V3 (2023) introduced normalisations and tricks (symlog rewards, percentile critic) that gave a single set of hyperparameters working across Atari, ProcGen, DMLab, MineCraft, and Crafter; in MineCraft, Dreamer V3 was the first algorithm to collect diamonds without human demonstrations.

The training objective combines a representation-learning loss (reconstruct observations from latents), a dynamics-prediction loss (predict next latent), a reward-prediction loss, and a continuation predictor, all summed on real trajectories from a replay buffer:

$$\mathcal{L} = \mathcal{L}_\text{recon} + \mathcal{L}_\text{dyn} + \mathcal{L}_\text{reward} + \mathcal{L}_\text{cont} + \beta\, \mathrm{KL}(q(z_t \mid o_t) \| p(z_t \mid h_t))$$

The actor-critic is then trained on imagined rollouts of the world model, dramatically improving sample efficiency.

World models bridge model-free RL and model-based control. Compared to model-free RL, they are far more sample-efficient (humans need a few hundred episodes; Dreamer needs a few million Atari frames where DQN needs hundreds of millions). Compared to MPC, they integrate seamlessly with neural perception and learned policies: the model is differentiable, so policy gradients flow through it, and the policy is amortised rather than re-optimised every step.

World models are a current frontier of AI research. Genie (Bruce et al. 2024) learns a world model from pure video that responds to action-like latent inputs. MuZero (Schrittwieser et al. 2020) combines a learned world model with Monte Carlo tree search to master Go, chess, shogi, and Atari with the same algorithm. The conceptual line to LLMs is clean: a world model is to an embodied agent what a language model is to a text agent, a generative simulator of the future used for planning and decision-making.

Discussed in:

Chapter 12: Sequence Models, Robotics and Control

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).