Glossary

Bellman Equation

The Bellman equation is the foundational equation of dynamic programming and reinforcement learning. For a Markov decision process with states $s$, actions $a$, transition $P(s' | s, a)$, reward $R(s, a)$ and discount $\gamma \in [0, 1)$, the value function $V^\pi$ of policy $\pi$ satisfies

$$V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot|s)}\!\left[R(s, a) + \gamma \sum_{s'} P(s' | s, a) V^\pi(s')\right]$$

The action-value function $Q^\pi$ satisfies

$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \mathbb{E}_{a' \sim \pi(\cdot|s')}[Q^\pi(s', a')]$$

The Bellman optimality equations characterise the optimal value functions:

$$V^*(s) = \max_a \!\left[R(s, a) + \gamma \sum_{s'} P(s' | s, a) V^*(s')\right]$$

$$Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \max_{a'} Q^*(s', a')$$

Once $V^*$ or $Q^*$ is known, the optimal policy is greedy: $\pi^*(s) = \arg\max_a Q^*(s, a)$.

Solution methods:

Value iteration iteratively applies the Bellman optimality operator:

$$V^{(k+1)}(s) = \max_a [R(s, a) + \gamma \sum_{s'} P(s' | s, a) V^{(k)}(s')]$$

The Bellman operator is a contraction with modulus $\gamma$ in the supremum norm, so $V^{(k)} \to V^*$ geometrically.

Policy iteration alternates policy evaluation (compute $V^\pi$ for current $\pi$) and policy improvement ($\pi'(s) = \arg\max_a Q^\pi(s, a)$). Converges in finitely many steps for finite MDPs.

Linear programming can also solve MDPs exactly: the dual LP is the standard formulation.

In RL, when $P$ and $R$ are unknown, the Bellman equation is enforced approximately:

  • TD learning: $V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$ (one-step TD).
  • Q-learning: $Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$.
  • Deep RL: minimise the squared TD error as a loss for a neural Q-network or value network.

Every modern RL algorithm, from tabular Q-learning to AlphaZero to RLHF for language models, has the Bellman equation as its theoretical foundation.

Video

Related terms: Q-Learning, Reinforcement Learning, DQN, Temporal-Difference Learning, Markov Decision Process

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).