Markov Chain, Glossary, Textbook of AI

A Markov chain is a sequence of random variables $X_0, X_1, X_2, \ldots$ satisfying the Markov property:

$$P(X_{t+1} | X_t, X_{t-1}, \ldots, X_0) = P(X_{t+1} | X_t)$$

The future depends only on the present, not on the past, the process is "memoryless".

For a chain with finite state space $\mathcal{S} = \{1, \ldots, K\}$, the dynamics are specified by:

An initial distribution $\pi_0 \in \mathbb{R}^K$, $\pi_{0,i} = P(X_0 = i)$.
A transition matrix $P \in \mathbb{R}^{K \times K}$ with $P_{ij} = P(X_{t+1} = j | X_t = i)$.

Each row of $P$ is a probability distribution: $\sum_j P_{ij} = 1$.

State distribution after $t$ steps: $\pi_t = \pi_0 P^t$.

Stationary distribution $\pi^*$ satisfies $\pi^* = \pi^* P$, left eigenvector of $P$ with eigenvalue 1. For an ergodic chain (irreducible and aperiodic), $\pi_t \to \pi^*$ as $t \to \infty$ regardless of $\pi_0$.

Irreducibility: every state can be reached from every other (with positive probability in finite time).

Aperiodicity: gcd of return times to any state is 1. (A periodic chain alternates between sets of states deterministically; ergodic theorem fails.)

Detailed balance (sufficient for stationarity): $\pi_i P_{ij} = \pi_j P_{ji}$ for all $i, j$. A chain satisfying detailed balance is reversible.

MCMC algorithms construct chains with desired stationary distributions:

Metropolis-Hastings: detailed balance with respect to target $\pi$ via accept/reject.
Gibbs sampling: detailed balance via conditional updates.
Hamiltonian MC: leverages auxiliary momentum and Hamiltonian dynamics.

In AI / ML:

Hidden Markov models: latent state is a Markov chain.
Markov decision processes: the state-action transitions are Markov.
Diffusion models: forward and reverse processes are Markov chains in continuous state.
Language modelling at character/token level: an autoregressive model $P(x_t | x_{\lt t})$ that conditions on only $x_{t-k:t-1}$ (limited context) approximates a higher-order Markov chain.
PageRank: stationary distribution of the random-walk transition matrix on a directed graph.
Markov random fields: undirected graphical models with Markov properties on the graph.

Mixing time $\tau_\mathrm{mix}$ is the time for the chain to converge near $\pi^*$ from any starting distribution. Bounded by spectral gap: $\tau_\mathrm{mix} = O(1/(1 - \lambda_2))$ where $\lambda_2$ is the second-largest eigenvalue of $P$. Slow mixing limits MCMC's practical applicability.

Related terms: Markov Decision Process, Hidden Markov Model, MCMC

Discussed in:

Chapter 4: Probability, Probability

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).