Deep Q-Network (DQN) combines Q-learning with deep neural networks. Mnih et al. (2013, 2015 Nature) demonstrated human-level Atari play from raw pixels. The two key engineering tricks that stabilised training:
Experience replay: store transitions $(s, a, r, s')$ in a replay buffer $\mathcal{D}$. Sample mini-batches uniformly for training. Breaks temporal correlation between consecutive samples and reuses each transition many times.
Target network: maintain a slowly-updated copy $Q_{\theta^-}$ of the Q-network, used for the bootstrap target:
$$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\!\left[\bigl(r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a)\bigr)^2\right]$$
The target network is synced to the online network periodically (every $C$ steps) or via Polyak averaging $\theta^- \leftarrow \tau \theta + (1 - \tau) \theta^-$. This stabilises training that would diverge with a moving target.
Architecture (original Atari DQN): 3 conv layers + 2 fully-connected, $\sim 1.7M$ parameters. Input: 4 stacked greyscale frames at $84 \times 84$. Output: $|\mathcal{A}|$ Q-values, one per action.
Variants:
- Double DQN (van Hasselt 2016): use online network for action selection, target for value.
- Dueling DQN (Wang 2016): factorise $Q(s, a) = V(s) + A(s, a)$.
- Prioritised replay (Schaul 2016): sample transitions in proportion to TD error.
- Rainbow (Hessel 2018): combines six DQN improvements.
- R2D2 (2019): adds recurrence and distributed training.
DQN is the foundation of value-based deep RL and remains widely used in domains with discrete action spaces.
Video
Related terms: Q-Learning, Reinforcement Learning, Temporal-Difference Learning, volodymyr-mnih
Discussed in:
- Chapter 1: What Is AI?, A Brief History of AI