Temporal-Difference Learning, Glossary, Textbook of AI

Temporal-difference (TD) learning, formalised by Richard Sutton in his 1984 PhD thesis and 1988 paper, is a family of reinforcement-learning methods that update a value estimate by the difference between successive predictions. The simplest version, TD(0), updates V(s) ← V(s) + α [r + γ V(s′) − V(s)] after observing a transition s → s′ with reward r and discount factor γ. The bracketed quantity is the TD error, the discrepancy between the new prediction r + γ V(s′) and the old prediction V(s).

TD learning generalises both Monte Carlo methods (which update at the end of an episode based on the actual return) and dynamic programming (which would require a known model). TD(λ) interpolates between these extremes via eligibility traces, with λ controlling how far back to credit. The algorithm has solid theoretical guarantees in tabular settings and a much harder analysis (the "deadly triad" of function approximation, off-policy learning and bootstrapping) in the function-approximation regime that modern deep RL inhabits.

TD learning underlies Q-learning (Watkins, 1989), SARSA, TD-Gammon (Tesauro, 1992), the DQN of Mnih et al. (2013), and through them every modern deep-RL algorithm. Sutton's 2024 Turing Award (with Andrew Barto) was for the foundational work on TD learning and reinforcement learning more broadly. Arthur Samuel's 1959 checkers program contained recognisably TD-style updates; Sutton's contribution was to extract, formalise and analyse the underlying algorithmic core.

Video

Related terms: Reinforcement Learning, Q-Learning, richard-sutton, andrew-barto, gerald-tesauro

Discussed in:

Chapter 1: What Is AI?, A Brief History of AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).