RL · DeepMind × UCL · 2021

Reinforcement Learning Lecture Series

with Hado van Hasselt, Diana Borsa, Matteo Hessel

Official course page →

Your progress in this browser

Lectures · 0 / 12 watched

Quiz · 0 / 6 correct

Progress is stored in this browser only — there is no account, no login, and no database. Clearing your browser data will reset it.

About the course

The DeepMind × UCL series, taught in 2021 by Hado van Hasselt and colleagues, is the successor to David Silver's classic 2015 UCL course. It is the cleanest currently-available presentation of reinforcement learning, going from the mathematical foundations (MDPs, dynamic programming, the Bellman equations) through model-free control (Q-learning, SARSA, policy gradients) to deep RL methods (DQN, A3C, AlphaGo-style search). The lecturers were all part of the research group that produced those results, so the historical commentary is first-hand.

The series assumes a working knowledge of probability and the basics of deep learning — read our probability and neural-networks chapters first. It pairs naturally with our modern-AI chapter, where we discuss RLHF, which is the most visible application of these ideas to large language models in 2024–2025.

Watch the lectures

Open the full playlist on YouTube →

Syllabus

Tick lectures as you finish them. Your ticks live in this browser only.

  1. Hado van Hasselt

    What RL is — agents, environments, rewards, policies. The interaction-loop view. How RL differs from supervised learning.

  2. Hado van Hasselt

    Bandits, $\varepsilon$-greedy, UCB, Thompson sampling. The regret bound view.

  3. Diana Borsa

    MDPs, the Bellman equations for the value and action-value functions. Stationary policies.

  4. Diana Borsa

    Policy evaluation, policy improvement, policy iteration, value iteration. Convergence guarantees.

  5. Hado van Hasselt

    Monte Carlo and TD($\lambda$) prediction. Why bootstrapping works.

  6. Hado van Hasselt

    On-policy (SARSA) vs off-policy (Q-learning) control. The interplay with exploration.

  7. Hado van Hasselt

    Linear function approximation, the deadly triad (bootstrapping + off-policy + approximation), what goes wrong.

  8. Hado van Hasselt

    The policy-gradient theorem, REINFORCE, baselines, advantage estimation.

  9. Matteo Hessel

    DQN, experience replay, target networks. Stability tricks that made deep value-based RL work.

  10. Matteo Hessel

    Tree search, Monte Carlo tree search. The architecture used in AlphaGo and AlphaZero.

  11. Hado van Hasselt

    Per-decision importance sampling. The high-variance problem and weighted importance sampling.

  12. Hado van Hasselt

    Learning a model of the dynamics. Dyna, value-gradient methods, MuZero.

Self-assessment

A short multi-choice quiz. Click an option to commit; the correct answer and an explanation appear. Your answers are remembered in this browser.

  1. Question 1. An agent's discounted return from time $t$ is $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$. The discount factor $\gamma$ being $< 1$ ensures:

  2. Question 2. The Bellman optimality equation for the action-value function $Q^*$ is:

  3. Question 3. Q-learning is off-policy because:

  4. Question 4. The deadly triad in RL is the combination of:

  5. Question 5. DQN combines Q-learning with a deep network. Two stability tricks that make this work are:

  6. Question 6. The policy-gradient theorem says that the gradient of the expected return with respect to the policy parameters $\boldsymbol{\theta}$ is:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).