Q-Learning, Glossary, Textbook of AI

Q-learning (Watkins 1989) is a model-free off-policy reinforcement-learning algorithm. It estimates the optimal action-value function

$$Q^*(s, a) = \mathbb{E}\!\left[\sum_{t=0}^\infty \gamma^t r_t \,\Big|\, s_0 = s, a_0 = a, \pi^*\right]$$

, the expected return from taking action $a$ in state $s$ and then following the optimal policy. The optimal policy is then $\pi^*(s) = \arg\max_a Q^*(s, a)$.

The Bellman optimality equation characterises $Q^*$:

$$Q^*(s, a) = \mathbb{E}_{s' \sim P(\cdot|s,a)}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]$$

Q-learning's update rule, on observing a transition $(s, a, r, s')$:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \!\left[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right]$$

The bracketed term is the TD error, the difference between the current estimate and a one-step bootstrapped target. With learning rate $\alpha \in (0, 1)$, sufficient exploration, and decaying $\alpha$, Q-learning provably converges to $Q^*$ in tabular settings (finite states/actions).

Off-policy: the update uses $\max_{a'} Q(s', a')$ regardless of which action the agent actually takes next, so Q-learning can learn from arbitrary exploration policies (including random, $\varepsilon$-greedy, or replayed past experience).

Deep Q-Network (DQN) (Mnih et al. 2013, 2015) extends Q-learning to high-dimensional state spaces using a deep network $Q_\theta(s, a)$ and adds two stabilising tricks:

Experience replay: store transitions $(s, a, r, s')$ in a replay buffer $\mathcal{D}$; sample mini-batches for training. Breaks correlations between consecutive samples.

Target network: maintain a slowly-updated copy $Q_{\theta^-}$ used for the bootstrap target:

$$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\!\left[\bigl(r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a)\bigr)^2\right]$$

The target network is periodically (or via Polyak averaging) synced with the online network. This stabilises training that would otherwise diverge due to the moving target.

DQN variants:

Double DQN (van Hasselt 2016): use online network for action selection, target network for value estimation, reducing maximisation bias.
Dueling DQN (Wang 2016): factorise $Q(s, a) = V(s) + A(s, a)$.
Prioritised experience replay (Schaul 2016): sample transitions in proportion to TD error.
Rainbow (Hessel 2018): combines six DQN improvements.
R2D2 (2019): adds recurrence and distributed training.

Q-learning is the foundation of value-based deep RL and remains one of the most widely-used algorithms in domains with discrete action spaces.

Video

Related terms: Reinforcement Learning, DQN, Temporal-Difference Learning, Bellman Equation, volodymyr-mnih

Discussed in:

Chapter 1: What Is AI?, A Brief History of AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).