DQN, Glossary, Textbook of AI

Deep Q-Network (DQN) combines Q-learning with deep neural networks. Mnih et al. (2013, 2015 Nature) demonstrated human-level Atari play from raw pixels. The two key engineering tricks that stabilised training:

Experience replay: store transitions $(s, a, r, s')$ in a replay buffer $\mathcal{D}$. Sample mini-batches uniformly for training. Breaks temporal correlation between consecutive samples and reuses each transition many times.

Target network: maintain a slowly-updated copy $Q_{\theta^-}$ of the Q-network, used for the bootstrap target:

$$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\!\left[\bigl(r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a)\bigr)^2\right]$$

The target network is synced to the online network periodically (every $C$ steps) or via Polyak averaging $\theta^- \leftarrow \tau \theta + (1 - \tau) \theta^-$. This stabilises training that would diverge with a moving target.

Architecture (original Atari DQN): 3 conv layers + 2 fully-connected, $\sim 1.7M$ parameters. Input: 4 stacked greyscale frames at $84 \times 84$. Output: $|\mathcal{A}|$ Q-values, one per action.

Variants:

Double DQN (van Hasselt 2016): use online network for action selection, target for value.
Dueling DQN (Wang 2016): factorise $Q(s, a) = V(s) + A(s, a)$.
Prioritised replay (Schaul 2016): sample transitions in proportion to TD error.
Rainbow (Hessel 2018): combines six DQN improvements.
R2D2 (2019): adds recurrence and distributed training.

DQN is the foundation of value-based deep RL and remains widely used in domains with discrete action spaces.

Video

Related terms: Q-Learning, Reinforcement Learning, Temporal-Difference Learning, volodymyr-mnih

Discussed in:

Chapter 1: What Is AI?, A Brief History of AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).