Q-learning (Watkins 1989) is a model-free off-policy reinforcement-learning algorithm. It estimates the optimal action-value function
$$Q^*(s, a) = \mathbb{E}\!\left[\sum_{t=0}^\infty \gamma^t r_t \,\Big|\, s_0 = s, a_0 = a, \pi^*\right]$$
, the expected return from taking action $a$ in state $s$ and then following the optimal policy. The optimal policy is then $\pi^*(s) = \arg\max_a Q^*(s, a)$.
The Bellman optimality equation characterises $Q^*$:
$$Q^*(s, a) = \mathbb{E}_{s' \sim P(\cdot|s,a)}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]$$
Q-learning's update rule, on observing a transition $(s, a, r, s')$:
$$Q(s, a) \leftarrow Q(s, a) + \alpha \!\left[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right]$$
The bracketed term is the TD error, the difference between the current estimate and a one-step bootstrapped target. With learning rate $\alpha \in (0, 1)$, sufficient exploration, and decaying $\alpha$, Q-learning provably converges to $Q^*$ in tabular settings (finite states/actions).
Off-policy: the update uses $\max_{a'} Q(s', a')$ regardless of which action the agent actually takes next, so Q-learning can learn from arbitrary exploration policies (including random, $\varepsilon$-greedy, or replayed past experience).
Deep Q-Network (DQN) (Mnih et al. 2013, 2015) extends Q-learning to high-dimensional state spaces using a deep network $Q_\theta(s, a)$ and adds two stabilising tricks:
Experience replay: store transitions $(s, a, r, s')$ in a replay buffer $\mathcal{D}$; sample mini-batches for training. Breaks correlations between consecutive samples.
Target network: maintain a slowly-updated copy $Q_{\theta^-}$ used for the bootstrap target:
$$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\!\left[\bigl(r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a)\bigr)^2\right]$$
The target network is periodically (or via Polyak averaging) synced with the online network. This stabilises training that would otherwise diverge due to the moving target.
DQN variants:
- Double DQN (van Hasselt 2016): use online network for action selection, target network for value estimation, reducing maximisation bias.
- Dueling DQN (Wang 2016): factorise $Q(s, a) = V(s) + A(s, a)$.
- Prioritised experience replay (Schaul 2016): sample transitions in proportion to TD error.
- Rainbow (Hessel 2018): combines six DQN improvements.
- R2D2 (2019): adds recurrence and distributed training.
Q-learning is the foundation of value-based deep RL and remains one of the most widely-used algorithms in domains with discrete action spaces.
Video
Related terms: Reinforcement Learning, DQN, Temporal-Difference Learning, Bellman Equation, volodymyr-mnih
Discussed in:
- Chapter 1: What Is AI?, A Brief History of AI