TD-Gammon, Glossary, Textbook of AI

TD-Gammon, developed by Gerald Tesauro at IBM Research between 1989 and 1995, was a backgammon-playing program that learned by playing approximately 1.5 million games against itself, using temporal-difference learning to train a feed-forward neural-network value function. It was the first system to demonstrate that self-play reinforcement learning could reach world-class performance in a complex domain, and stands as the direct intellectual ancestor of AlphaGo, AlphaZero, and the modern era of deep reinforcement learning.

Architecture and learning rule

TD-Gammon used a single hidden-layer multi-layer perceptron with 80 hidden units (later versions had up to 160). The input was a hand-engineered representation of the board: 198 features encoding the number of checkers on each of the 24 points for each player, plus encodings of the bar, the bear-off tray, and whose turn it was. The output was a four-dimensional vector representing the probabilities of each possible game outcome (win, gammon win, gammon loss, etc.), from which an expected value could be computed.

Learning followed the TD($\lambda$) update rule of Sutton (1988):

$$\Delta w_t = \alpha (V_{t+1} - V_t) \sum_{k=1}^{t} \lambda^{t-k} \nabla_w V_k.$$

At each move, the network's value estimate $V_t$ for the current position was nudged toward $V_{t+1}$, the value estimate after the move actually taken. The eligibility trace $\lambda$ controlled how far back along the trajectory credit was assigned. There was no explicit reward except at the end of the game, when the network received the actual outcome.

Performance and historical influence

TD-Gammon 1.0 played at intermediate level. TD-Gammon 2.1, with selective two-ply lookahead, reached the level of the strongest human players. TD-Gammon 3.0, with three-ply lookahead, was widely judged to be slightly superior to the world's top professionals, and crucially, it had changed backgammon theory: several of its preferred opening moves were unconventional but, after analysis, recognised by experts as correct, leading to a re-evaluation of human opening play. Modern backgammon bots descended from TD-Gammon are now the standard reference for theoretical analysis of the game.

The system's success was, for many years, attributed in part to backgammon's stochasticity: the dice rolls injected exploration for free by mixing up positions enough that the network saw a wide variety of states. This led to scepticism that the same approach would work in deterministic games. The 2016 AlphaGo victory over Lee Sedol, and AlphaZero's subsequent mastery of chess and shogi, definitively settled the question. The original AlphaGo paper explicitly traces its lineage to TD-Gammon, and Tesauro's work is canonically cited as the first demonstration of the methodology that now drives much of game-playing AI.

Why it mattered

Beyond backgammon, TD-Gammon established three principles that have shaped deep RL:

Self-play generates curriculum, as the network improves, its opponent (itself) also improves, providing a steadily harder training signal.
Bootstrapping works at scale, the TD update bootstraps from the network's own future estimates, despite no theoretical guarantees of convergence with non-linear function approximators.
Hand-crafted features can be augmented, not replaced, TD-Gammon's input features were domain-engineered, but the value function was learned end-to-end.

Discussed in:

Chapter 13: Attention & Transformers, Reinforcement learning

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).