RL · DeepMind × UCL · 2021
Reinforcement Learning Lecture Series
with Hado van Hasselt, Diana Borsa, Matteo Hessel
Your progress in this browser
Lectures · 0 / 12 watched
Quiz · 0 / 6 correct
Progress is stored in this browser only — there is no account, no login, and no database. Clearing your browser data will reset it.
About the course
The DeepMind × UCL series, taught in 2021 by Hado van Hasselt and colleagues, is the successor to David Silver's classic 2015 UCL course. It is the cleanest currently-available presentation of reinforcement learning, going from the mathematical foundations (MDPs, dynamic programming, the Bellman equations) through model-free control (Q-learning, SARSA, policy gradients) to deep RL methods (DQN, A3C, AlphaGo-style search). The lecturers were all part of the research group that produced those results, so the historical commentary is first-hand.
The series assumes a working knowledge of probability and the basics of deep learning — read our probability and neural-networks chapters first. It pairs naturally with our modern-AI chapter, where we discuss RLHF, which is the most visible application of these ideas to large language models in 2024–2025.
Watch the lectures
Syllabus
Tick lectures as you finish them. Your ticks live in this browser only.
-
Hado van Hasselt
What RL is — agents, environments, rewards, policies. The interaction-loop view. How RL differs from supervised learning.
-
Hado van Hasselt
Bandits, $\varepsilon$-greedy, UCB, Thompson sampling. The regret bound view.
-
Diana Borsa
MDPs, the Bellman equations for the value and action-value functions. Stationary policies.
-
Diana Borsa
Policy evaluation, policy improvement, policy iteration, value iteration. Convergence guarantees.
-
Hado van Hasselt
Monte Carlo and TD($\lambda$) prediction. Why bootstrapping works.
-
Hado van Hasselt
On-policy (SARSA) vs off-policy (Q-learning) control. The interplay with exploration.
-
Hado van Hasselt
Linear function approximation, the deadly triad (bootstrapping + off-policy + approximation), what goes wrong.
-
Hado van Hasselt
The policy-gradient theorem, REINFORCE, baselines, advantage estimation.
-
Matteo Hessel
DQN, experience replay, target networks. Stability tricks that made deep value-based RL work.
-
Matteo Hessel
Tree search, Monte Carlo tree search. The architecture used in AlphaGo and AlphaZero.
-
Hado van Hasselt
Per-decision importance sampling. The high-variance problem and weighted importance sampling.
-
Hado van Hasselt
Learning a model of the dynamics. Dyna, value-gradient methods, MuZero.
Self-assessment
A short multi-choice quiz. Click an option to commit; the correct answer and an explanation appear. Your answers are remembered in this browser.
-
Question 1. An agent's discounted return from time $t$ is $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$. The discount factor $\gamma$ being $< 1$ ensures:
-
Question 2. The Bellman optimality equation for the action-value function $Q^*$ is:
-
Question 3. Q-learning is off-policy because:
-
Question 4. The deadly triad in RL is the combination of:
-
Question 5. DQN combines Q-learning with a deep network. Two stability tricks that make this work are:
-
Question 6. The policy-gradient theorem says that the gradient of the expected return with respect to the policy parameters $\boldsymbol{\theta}$ is:
This site is currently in Beta. Contact: Chris Paton
Textbook of Usability · Textbook of Digital Health
Auckland Maths and Science Tutoring
AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).