Also known as: MDP
A Markov decision process (MDP) is a tuple $\langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle$:
- $\mathcal{S}$: state space.
- $\mathcal{A}$: action space (may depend on state).
- $P(s' | s, a)$: transition probability of moving to state $s'$ from state $s$ given action $a$.
- $R(s, a)$: expected reward function. Sometimes specified as $R(s, a, s')$ depending on next state.
- $\gamma \in [0, 1]$: discount factor.
The Markov property requires that the next state and reward depend only on the current state and action, not on the history: $P(s_{t+1}, r_t | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1}, r_t | s_t, a_t)$.
A policy $\pi: \mathcal{S} \to \Delta(\mathcal{A})$ maps states to action distributions. The objective is to find $\pi$ maximising the expected discounted return
$$J(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t)\right].$$
Variants:
Partially observable MDP (POMDP): the agent does not directly observe the state, only an observation $o_t \sim O(\cdot | s_t)$. The agent must maintain a belief state, distribution over true states, which itself forms a Markov decision process over belief space. POMDPs are vastly harder than MDPs in general.
Continuous MDPs: $\mathcal{S}, \mathcal{A} \subseteq \mathbb{R}^n$. Standard for robotics and continuous control. Function approximation (neural networks) is essential for tractability.
Multi-agent MDPs: multiple agents take actions simultaneously, each potentially with its own reward. Game-theoretic equilibria (Nash, correlated, etc.) replace single-agent optimality.
Bandit problems: degenerate MDPs with one state and immediate-only rewards. Pure exploration-exploitation problems.
Solution complexity: a tabular MDP with $|\mathcal{S}| = n$ states and $|\mathcal{A}| = m$ actions can be solved by value iteration in $O(n^2 m / (1-\gamma) \log(1/\epsilon))$ time, or by linear programming in polynomial time. Continuous-state MDPs are generally PSPACE-hard. POMDPs are PSPACE-hard even for finite states.
In modern AI, the MDP framework underlies:
- Game playing (chess, Go, Atari, StarCraft).
- Robotics (manipulation, locomotion, navigation).
- Self-driving (each driving step is an MDP transition).
- Recommender systems (sequential interactions with users).
- LLM post-training: each token in a response is a step in an MDP, with the prompt being the initial state and human/AI feedback providing the reward signal.
Video
Related terms: Bellman Equation, Reinforcement Learning, Policy Gradient Theorem, Q-Learning
Discussed in:
- Chapter 1: What Is AI?, A Brief History of AI