Markov Decision Process, Glossary, Textbook of AI

Also known as: MDP

A Markov decision process (MDP) is a tuple $\langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle$:

$\mathcal{S}$: state space.
$\mathcal{A}$: action space (may depend on state).
$P(s' | s, a)$: transition probability of moving to state $s'$ from state $s$ given action $a$.
$R(s, a)$: expected reward function. Sometimes specified as $R(s, a, s')$ depending on next state.
$\gamma \in [0, 1]$: discount factor.

The Markov property requires that the next state and reward depend only on the current state and action, not on the history: $P(s_{t+1}, r_t | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1}, r_t | s_t, a_t)$.

A policy $\pi: \mathcal{S} \to \Delta(\mathcal{A})$ maps states to action distributions. The objective is to find $\pi$ maximising the expected discounted return

$$J(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t)\right].$$

Variants:

Partially observable MDP (POMDP): the agent does not directly observe the state, only an observation $o_t \sim O(\cdot | s_t)$. The agent must maintain a belief state, distribution over true states, which itself forms a Markov decision process over belief space. POMDPs are vastly harder than MDPs in general.

Continuous MDPs: $\mathcal{S}, \mathcal{A} \subseteq \mathbb{R}^n$. Standard for robotics and continuous control. Function approximation (neural networks) is essential for tractability.

Multi-agent MDPs: multiple agents take actions simultaneously, each potentially with its own reward. Game-theoretic equilibria (Nash, correlated, etc.) replace single-agent optimality.

Bandit problems: degenerate MDPs with one state and immediate-only rewards. Pure exploration-exploitation problems.

Solution complexity: a tabular MDP with $|\mathcal{S}| = n$ states and $|\mathcal{A}| = m$ actions can be solved by value iteration in $O(n^2 m / (1-\gamma) \log(1/\epsilon))$ time, or by linear programming in polynomial time. Continuous-state MDPs are generally PSPACE-hard. POMDPs are PSPACE-hard even for finite states.

In modern AI, the MDP framework underlies:

Game playing (chess, Go, Atari, StarCraft).
Robotics (manipulation, locomotion, navigation).
Self-driving (each driving step is an MDP transition).
Recommender systems (sequential interactions with users).
LLM post-training: each token in a response is a step in an MDP, with the prompt being the initial state and human/AI feedback providing the reward signal.

Video

Discussed in:

Chapter 1: What Is AI?, A Brief History of AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).