The Bellman equation is the foundational equation of dynamic programming and reinforcement learning. For a Markov decision process with states $s$, actions $a$, transition $P(s' | s, a)$, reward $R(s, a)$ and discount $\gamma \in [0, 1)$, the value function $V^\pi$ of policy $\pi$ satisfies
$$V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot|s)}\!\left[R(s, a) + \gamma \sum_{s'} P(s' | s, a) V^\pi(s')\right]$$
The action-value function $Q^\pi$ satisfies
$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \mathbb{E}_{a' \sim \pi(\cdot|s')}[Q^\pi(s', a')]$$
The Bellman optimality equations characterise the optimal value functions:
$$V^*(s) = \max_a \!\left[R(s, a) + \gamma \sum_{s'} P(s' | s, a) V^*(s')\right]$$
$$Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \max_{a'} Q^*(s', a')$$
Once $V^*$ or $Q^*$ is known, the optimal policy is greedy: $\pi^*(s) = \arg\max_a Q^*(s, a)$.
Solution methods:
Value iteration iteratively applies the Bellman optimality operator:
$$V^{(k+1)}(s) = \max_a [R(s, a) + \gamma \sum_{s'} P(s' | s, a) V^{(k)}(s')]$$
The Bellman operator is a contraction with modulus $\gamma$ in the supremum norm, so $V^{(k)} \to V^*$ geometrically.
Policy iteration alternates policy evaluation (compute $V^\pi$ for current $\pi$) and policy improvement ($\pi'(s) = \arg\max_a Q^\pi(s, a)$). Converges in finitely many steps for finite MDPs.
Linear programming can also solve MDPs exactly: the dual LP is the standard formulation.
In RL, when $P$ and $R$ are unknown, the Bellman equation is enforced approximately:
- TD learning: $V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$ (one-step TD).
- Q-learning: $Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$.
- Deep RL: minimise the squared TD error as a loss for a neural Q-network or value network.
Every modern RL algorithm, from tabular Q-learning to AlphaZero to RLHF for language models, has the Bellman equation as its theoretical foundation.
Video
Related terms: Q-Learning, Reinforcement Learning, DQN, Temporal-Difference Learning, Markov Decision Process
Discussed in:
- Chapter 1: What Is AI?, A Brief History of AI