MuZero, Schrittwieser, Antonoglou, Hubert, Simonyan, Sifre, Schmitt, Guez, Lockhart, Hassabis, Graepel, Lillicrap and Silver (DeepMind, Nature 2020), is the model-based reinforcement-learning algorithm that learns an internal model of environment dynamics directly from interaction, generalising AlphaZero to settings where the rules of the environment are unknown. It mastered chess, shogi, Go and 57 Atari games using the same architecture and same hyperparameters, achieving superhuman performance in board games and state-of-the-art on Atari. It is the conceptual ancestor of alphaproof's search-plus-learning architecture and of much of the planning-based reasoning work in modern LLMs.
The technical innovation is that MuZero does not learn the true environment dynamics, it learns a latent dynamics model that is sufficient for planning in latent space. Three networks are jointly trained:
- Representation function $h_\theta$: maps an observation history $o_{1:t}$ to a latent state $s_0 = h_\theta(o_{1:t})$.
- Dynamics function $g_\theta$: takes a latent state and an action and returns the next latent state plus a predicted reward, $(s_{k+1}, r_{k+1}) = g_\theta(s_k, a_{k+1})$.
- Prediction function $f_\theta$: from a latent state outputs a policy $p_k$ (action distribution) and a value $v_k$ (expected return), $(p_k, v_k) = f_\theta(s_k)$.
These three networks together form an abstract MDP in latent space whose semantics are learned from data. MuZero never tries to predict actual observations or reconstruct the environment; the latent state only needs to support accurate predictions of reward, value and policy, which is a much weaker constraint than predicting pixels or full game states.
Planning happens in latent space using Monte Carlo Tree Search (MCTS) with the learned dynamics. Starting from $s_0$, MCTS unrolls $g_\theta$ for several steps, scoring nodes via $f_\theta$ and selecting actions via UCB. The improved policy from MCTS becomes the supervision target for the prediction network, exactly as in AlphaZero.
The training loss combines three terms across $K$ unrolled steps:
$$\mathcal{L}_\theta = \sum_{k=0}^K \Big[ \ell^p(\pi_{t+k}, p_k) + \ell^v(z_{t+k}, v_k) + \ell^r(u_{t+k}, r_k) \Big] + c \|\theta\|^2,$$
where $\pi$ is the MCTS policy target, $z$ is the bootstrapped value target, $u$ is the observed reward, and the losses are cross-entropy or squared error.
The practical impact of MuZero on modern AI is broad. In games, it generalises AlphaZero's tabula-rasa learning to environments without explicit rules. In robotics it enables planning without writing down a physics model. In LLM reasoning, the architecture inspires systems that learn a value function over reasoning states (states being partial chain-of-thought trajectories) and use search to explore them, this is the lineage that connects MuZero to alphaproof's proof-tree search and to o1-style inference-time tree search. MuZero's separation of "learn what matters for planning, ignore what doesn't" is the conceptual key.
Two limits are worth noting. MuZero needs a clear reward signal; in environments with sparse or hard-to-define rewards, the value-function bootstrapping is unstable. And the latent dynamics can drift from the true dynamics over long unrolls, MuZero typically plans only 5–10 steps ahead, beyond which the model becomes unreliable. These are the same limits that any model-based RL system faces, but MuZero's empirical robustness across diverse environments was a strong signal that learned latent models could be the foundation of general planning. That signal carries directly into post-2024 reasoning models: learn the world (or the proof system, or the codebase) implicitly, then search in the learned latent space.
Related terms: AlphaProof Internals, AlphaGeometry 2, o1 / Reasoning Models, Self-Play on Verifiable Rewards, Policy Gradient Theorem, Inference-Time Scaling
Discussed in:
- Chapter 16: Ethics & Safety, Model-Based RL and Planning