1.5 The history of AI in seven waves

It is tempting, when first encountering a field, to want a tidy founding myth: a date, a person, a single thread of progress that runs from then to now. The history of artificial intelligence resists that wish. It is not a single arc but a sequence of distinguishable phases, each with its own intellectual centre of gravity, characteristic problems, favoured techniques and showcase systems, and its own reasons for ending. The phases overlap. People migrate between them, sometimes carrying their old assumptions with them, sometimes shedding them in mid-career. Whole research programmes are declared dead and then quietly resurrected ten years later by students who have either forgotten or chosen to ignore the previous generation's reasons for abandoning them.

Fundamental ideas decay slowly. Lisp, invented by John McCarthy in 1958, is still in active use. The A* search algorithm, formulated by Hart, Nilsson and Raphael in 1968, is still the workhorse behind everything from satellite-navigation route planning to videogame pathfinding. The backpropagation algorithm, popularised by Rumelhart, Hinton and Williams in 1986, is the basis of every deep neural network you will encounter in this book. Good ideas accumulate; the dates below mark when each idea entered the toolkit, not when it left.

The field has seven phases proper, preceded by a long pre-history.

Phase 0: pre-history (until 1943)

The intellectual ingredients of artificial intelligence assembled themselves over the long nineteenth and early twentieth centuries from three quite different lineages. None of them was, at the time, thought of as a precursor to AI. They happen to converge in 1943.

The first lineage is mathematical. It begins with George Boole's Laws of Thought (1854), which proposed that the operations of human reasoning could be captured in an algebra of true and false propositions. Twenty-five years later Gottlob Frege's Begriffsschrift (1879) extended this programme to predicate logic with quantifiers. Georg Cantor's set theory, in the same period, gave mathematics a foundational language for talking about collections, functions and infinities. David Hilbert's formalist programme, launched at the turn of the twentieth century, asked whether the whole of mathematics could be reduced to the manipulation of meaningless symbols according to fixed rules. Kurt Gödel's incompleteness theorems of 1931 demonstrated that no such reduction could be complete. And in 1936 Alonzo Church and Alan Turing, working independently, gave precise mathematical characterisations of what it means for a function to be effectively computable. Turing's paper, "On Computable Numbers, with an Application to the Entscheidungsproblem", introduced what we now call the Turing machine, an idealised device with an infinite tape, a finite state set and a small set of rules, and showed that there are well-defined functions which no Turing machine can compute. The Church–Turing thesis, that effective computability and Turing-machine computability coincide, became the ground on which both computer science and AI later built. Whether its converse holds for intelligence is, in many ways, the question of the field.

The second lineage is mechanical. Charles Babbage designed the Difference Engine in 1822 as a special-purpose calculator for polynomial functions, and in 1837 the more ambitious Analytical Engine, a programmable mechanical computer with separate memory and arithmetic units, controlled by punched cards adapted from the Jacquard loom. Babbage never completed either machine in his lifetime. Ada Lovelace, in her 1843 notes on Luigi Menabrea's account of the Analytical Engine, did three remarkable things. She wrote what is generally regarded as the first algorithm intended for execution by a machine, a procedure for computing Bernoulli numbers. She speculated that the Engine could in principle manipulate symbols other than numbers, musical notation, for instance, and so foresaw the idea of a general-purpose symbol-manipulating computer. And she registered the cautionary point, later named for her, that the Engine could only do what we knew how to order it to do; it would not originate anything.

The third lineage is biological. Across the nineteenth century, neuroanatomists and physiologists pieced together the basic story of how nervous systems compute. The action potential was identified as the fundamental signalling event of the neuron. Santiago Ramón y Cajal's painstaking microscopical drawings, made in the 1880s and 1890s, established the neuron doctrine: that the nervous system is composed of discrete cells with discrete connections. Charles Sherrington's Integrative Action of the Nervous System (1906) showed how the spinal cord and brain integrate graded inputs from many sources and produce coordinated outputs. By the late 1930s the rough outline of how a brain might compute, through individual cells summing weighted inputs and propagating thresholded outputs along axons, was a matter of scientific consensus.

The three strands meet in 1943 with the publication of "A Logical Calculus of the Ideas Immanent in Nervous Activity" by Warren McCulloch, a neurophysiologist, and Walter Pitts, a self-taught logician then in his late teens. Their paper does something remarkable. It takes the biological neuron, abstracts away its temporal dynamics, chemistry and modulatory effects, and reduces it to a mathematical idealisation: a unit which fires if and only if the weighted sum of its inputs exceeds a threshold. They then prove that networks of such idealised neurons can compute any function expressible in propositional logic. In a single move, they identify neurons with logic gates and the brain, by extension, with a computer. The McCulloch–Pitts neuron is openly a fiction, but it is the right kind of fiction, the kind that opens up a research programme. The next seventy years of artificial intelligence and computational neuroscience can be read as an extended elaboration on, and partial revolt against, that single equation.

Phase 1: cybernetics and symbol manipulation (1943–1956)

The decade between McCulloch and Pitts and the Dartmouth Workshop is mostly cybernetics, with what we now call symbolic AI emerging out of it. The name cybernetics, from the Greek kubernētēs, "steersman", was coined by Norbert Wiener for the study of feedback, control and information in animals, machines and societies. Wiener's Cybernetics: Or Control and Communication in the Animal and the Machine (1948) is the manifesto of the period. It argues that the same mathematical apparatus, feedback loops, transfer functions, Fourier analysis of signals, statistical theories of noise, describes the autopilot of a guided missile, the homeostatic regulation of body temperature and the goal-directed behaviour of a hunting animal.

In the same year Claude Shannon published A Mathematical Theory of Communication, founding information theory and giving the world the bit, channel capacity and the entropy formula. In a 1950 paper Shannon also sketched how a chess-playing program might work: the game tree, the evaluation function, the limited-depth search. He could not run the program (there was no computer yet powerful enough), but the design was correct, and the descendants of Shannon's chess sketch are still recognisable in modern game-playing systems.

Donald Hebb's The Organization of Behavior (1949) introduced what is now called the Hebb rule: neurons that fire together strengthen the connection between them. Alan Turing's 1950 paper, "Computing Machinery and Intelligence", which we discussed in §1.2, introduced what would later be called the Turing test. Marvin Minsky and his classmate Dean Edmonds, while graduate students, built the SNARC in 1951, the Stochastic Neural Analog Reinforcement Calculator, a forty-neuron analogue machine of vacuum tubes and motorised potentiometers that could learn to navigate a maze. Arthur Samuel, at IBM, began a long programme of work on a self-improving checkers-playing program in 1952; in 1962 it defeated the self-described checkers master Robert Nealey, and by the 1970s it had taken games from stronger players. Samuel's program, with its evaluation-function tuning by self-play, is the direct ancestor of TD-Gammon, of Deep Blue and ultimately of AlphaGo.

The Macy Conferences (1946–1953) were the institutional spine of the cybernetic period. They brought Wiener, McCulloch, Pitts, John von Neumann, the neurophysiologist Arturo Rosenblueth, the anthropologists Margaret Mead and Gregory Bateson, and many others into a sustained interdisciplinary conversation. By the mid-1950s the centre of gravity was moving away from continuous control toward discrete symbol manipulation, and from interdisciplinary conferences to dedicated AI laboratories.

The pivotal event is the 1956 Dartmouth Summer Research Project on Artificial Intelligence, organised by John McCarthy, Marvin Minsky, Nathaniel Rochester (then at IBM) and Claude Shannon. The proposal McCarthy and his colleagues circulated to obtain Rockefeller Foundation funding contains the first appearance in print of the term artificial intelligence, and frames the agenda, language, abstraction, learning, self-improvement, randomness and creativity, in a single sentence: "every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it." Phase 2 is what happens when serious people start trying to cash that conjecture.

Phase 2: the symbolic era (1956–1973)

Dartmouth opened a fifteen-year period of optimism and rapid progress within what we would now call the symbolic paradigm. Allen Newell, Cliff Shaw and Herbert Simon, working at the RAND Corporation and Carnegie Mellon, presented the Logic Theorist in 1956, a program that proved theorems from the Principia Mathematica of Whitehead and Russell, and in some cases produced more elegant proofs than the originals. Their General Problem Solver of 1957 generalised the Logic Theorist's strategy: a means-ends analysis that chose actions reducing the difference between current state and goal. The means-ends recipe will remain the spine of classical AI planning into the 1990s.

In 1958 John McCarthy, by then at MIT, invented Lisp. The language was formally described in his 1960 paper "Recursive Functions of Symbolic Expressions and Their Computation by Machine", which sets out a compact eval/apply structure: a uniform syntax in which programs and data have the same form. Lisp's suitability for symbolic manipulation made it the dominant AI programming language for thirty years. Also in 1958, Frank Rosenblatt at Cornell introduced the perceptron, a simple trainable model of the McCulloch–Pitts neuron, and built the Mark I Perceptron, a custom-wired hardware machine that could learn to recognise letters and shapes. The perceptron brought connectionism into the public eye, often in extravagant terms; the New York Times in 1958 reported that the Navy expected the perceptron to be "the embryo of an electronic computer that... will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."

Institutionally, the period sees the formation of the great American AI laboratories. Minsky and Seymour Papert established the MIT AI Lab. McCarthy moved to Stanford in 1962 and founded SAIL. The Defense Advanced Research Projects Agency (then ARPA), under J. C. R. Licklider, began the funding programme that supported the field for several decades.

By the late 1960s the field had its showcase systems. DENDRAL, developed at Stanford by Edward Feigenbaum, Joshua Lederberg and Bruce Buchanan from 1965 onward, used heuristic search through the space of possible molecular structures to interpret mass-spectrometry data, and is generally regarded as the first successful expert system. STUDENT, written in Lisp by Daniel Bobrow at MIT in 1964, solved high-school algebra word problems by translating English into systems of equations. ELIZA, written by Joseph Weizenbaum at MIT in 1966, simulated a Rogerian psychotherapist by simple keyword pattern-matching and reflective rephrasing; users, including Weizenbaum's secretary, were notoriously willing to pour out their troubles to a few hundred lines of pattern-substitution code. SHRDLU, the doctoral work of Terry Winograd at MIT in 1972, held fluent typed dialogue about a virtual world of coloured blocks: pick up the red block, put it on top of the green pyramid, why did you do that?, because you asked me to. MACSYMA, also at MIT, performed symbolic mathematics, algebraic manipulation, integration, simplification, at a level competitive with the best human algebraists, and is the direct ancestor of modern systems such as Mathematica and Maple.

The period also consolidates a set of methods that will remain useful for decades. A* search, formulated by Peter Hart, Nils Nilsson and Bertram Raphael in 1968, gives an admissible heuristic-search algorithm for finding shortest paths in graphs; it is the basis of route-planning systems and pathfinding in games to this day. John Alan Robinson's resolution principle (1965) provides a single inference rule that is sound and complete for first-order predicate logic, and underwrites both automated theorem proving and the logic-programming language Prolog that Alain Colmerauer would develop in 1972. STRIPS, the Stanford Research Institute Problem Solver of Richard Fikes and Nils Nilsson (1971), introduces the action-and-precondition representation that all classical AI planners will use into the 1990s.

The phase ends, more or less, between 1969 and 1973. Three blows fall in close succession. Marvin Minsky and Seymour Papert publish Perceptrons in 1969, a careful mathematical analysis showing that single-layer perceptrons cannot represent functions that are not linearly separable, the classic example is the XOR function, and arguing that multi-layer extensions, while in principle more powerful, lack effective training algorithms. The book's effect on the connectionist programme is severe and, as we now know, premature: backpropagation will solve the training problem for multi-layer networks within fifteen years. Second, the optimistic predictions of the 1956–1965 period, translation-quality machine translation within five years, general-purpose problem solvers, machines reasoning at the level of human professionals, have not materialised. Speech recognition has stalled. Machine translation has produced famous embarrassments. The ALPAC report of 1966 had already cut US machine-translation funding sharply. Third, in 1973 the Lighthill Report, commissioned by the British Science Research Council and authored by the applied mathematician Sir James Lighthill, concludes that general-purpose AI has failed to live up to its promises and recommends substantial cuts in UK research funding. The report singles out the "combinatorial explosion", the exponential blow-up of search trees, as a fundamental obstacle. ARPA, on the US side, narrows its AI funding sharply in the same period, increasingly insisting on mission-relevant deliverables. Funding contracts, doctoral programmes shrink, several of the big labs lose key personnel. The first AI winter has begun.

Phase 3: knowledge-based systems (1973–1987)

The lesson the field took away from the failures of the late 1960s was that intelligent behaviour requires a great deal of domain-specific knowledge, and that a small amount of careful knowledge engineering, applied to a narrow domain, will outperform a large amount of general-purpose reasoning. Generality had been the enemy. The 1970s and early 1980s are, in consequence, the era of the expert system: a program that captures the rules a human expert uses to make decisions in a circumscribed area, and applies them by some form of guided search.

The defining showcase system is MYCIN, developed by Edward Shortliffe at Stanford as his doctoral thesis and published in 1976. MYCIN diagnosed bacterial infections of the bloodstream, recommended antibiotics and dosages, and explained its reasoning in English. It worked from a base of about 600 production rules of the form "if these conditions hold, then conclude with this confidence". On its evaluation set, MYCIN's recommendations were rated as comparable to those of expert infectious-disease physicians by an independent panel. MYCIN was never deployed clinically, partly for liability reasons, but its architecture, a clean separation between an inference engine, a knowledge base of rules and an explanation facility, became the template for the entire generation of expert systems that followed, and was later codified in development tools such as OPS5 and CLIPS.

Other systems followed. R1, later renamed XCON, was developed by John McDermott at Carnegie Mellon and deployed at Digital Equipment Corporation in 1980. Its job was to take customer orders for VAX minicomputers and configure them, choosing power supplies, cabinets, cables, boards, into a coherent shipped system. By 1986 XCON had over 6000 rules, was processing about 80 000 orders a year, and was estimated to be saving DEC of the order of $25 million annually. It was the existence proof that expert systems could produce real, measurable industrial value, not merely interesting research. Hundreds of expert-system projects followed: PROSPECTOR for mineral exploration, CADUCEUS for internal medicine, DENDRAL's descendants for chemistry. Knowledge engineering, the discipline of eliciting expert knowledge from human practitioners and encoding it as rules, became a recognised speciality with its own conferences and textbooks.

The hardware of the period reflected its software. Lisp machines, workstations whose hardware architecture was specialised for the tagged-pointer arithmetic and garbage collection of Lisp, were produced by Symbolics and Lisp Machines Incorporated and sold for tens of thousands of dollars to AI research labs and corporate development teams. The machines were beautiful pieces of engineering: high-resolution bitmap displays, integrated development environments, structure editors for code, the whole tower of an interactive Lisp environment realised in dedicated hardware. Japan, watching the American expert-systems boom with interest, launched its Fifth Generation Computer Systems Project in 1982, a $400 million ten-year programme, coordinated by the Ministry of International Trade and Industry, aimed at building parallel logic-programming machines that would run Prolog at hardware speed and, the project's planners hoped, leapfrog American computing entirely.

Two limits accumulated through the mid-1980s. The first was that expert systems do not learn. Each rule has to be elicited from a human expert, transcribed, debugged and maintained by hand. Rules contradict each other; the resolution of conflicts becomes its own engineering problem. The world the rules describe drifts, drug names change, regulations are updated, biology refines its categories, and the rule base becomes brittle in ways that are not always easy to detect. The second was that expert systems handle exceptions poorly. Medicine in particular is a long tail of unusual cases, and a rule base that performs well on the textbook eighty per cent of presentations falls over on the long tail of atypical ones, sometimes silently and dangerously. The deeper problem was philosophical as much as technical: hand-coded rules require somebody to know, in advance, what knowledge is relevant. Real expertise turns out not to take that form.

By the mid-1980s the limits were visible, and by 1987 the second AI winter had begun. The Lisp-machine market, which had been driven entirely by expert-system development, collapsed almost overnight as ordinary personal workstations, Sun Microsystems machines running Common Lisp, then increasingly C, became competitive at a fraction of the cost. Symbolics filed for Chapter 11 in 1993; LMI was gone by 1987. The Japanese Fifth Generation project quietly underperformed, producing impressive hardware that no one outside the project particularly wanted. AI funding contracted. Several of the second wave of AI laboratories closed or shrank. The field became, for the second time in twenty years, a quiet and somewhat embarrassed corner of computer science.

Phase 4: the connectionist revival and statistical turn (1987–2011)

While the symbolic side faded, two related strands rose in parallel through the late 1980s and into the 2000s. The first was the connectionist revival. The second was the statistical turn. Together they would, by 2011, have rebuilt AI on quite different foundations from the rule-based systems that preceded them.

The connectionist revival traces back to 1986, the year Parallel Distributed Processing was published by David Rumelhart, James McClelland and the PDP Research Group. The two-volume work brought the backpropagation algorithm, already independently derived by Seppo Linnainmaa in 1970, by Paul Werbos in 1974, and by David Parker in 1985, to a wide audience and showed that multi-layer neural networks could in principle be trained on arbitrary differentiable losses. The XOR problem that had appeared insurmountable in 1969 dissolved within a few hidden units. The convolutional neural network of Yann LeCun and his collaborators, LeNet in 1989, the more refined LeNet-5 in 1998, demonstrated that weight sharing and local receptive fields could produce a network architecture suited to image recognition; LeNet-5 read postal codes for the US Postal Service and bank-cheque amounts for banks for many years. The recurrent networks of Sepp Hochreiter and Jürgen Schmidhuber, the long short-term memory or LSTM, introduced in 1997, solved the vanishing-gradient problem that had prevented vanilla recurrent networks from learning long-range dependencies, and would dominate sequence modelling for the next twenty years.

The statistical turn ran in parallel. The general direction was a move from hand-coded rules to learned probabilistic models. Judea Pearl's Probabilistic Reasoning in Intelligent Systems (1988) systematised the use of Bayesian networks for uncertain reasoning, and provided the algorithmic machinery, variable elimination, junction trees, belief propagation, that made them tractable. Hidden Markov models, surveyed by Lawrence Rabiner in his influential 1989 tutorial, became standard for speech recognition through the 1990s and into the 2000s. Support vector machines, introduced by Corinna Cortes and Vladimir Vapnik in 1995, gave a clean theoretical framework for binary classification with margin guarantees; the kernel methods that grew out of SVM theory dominated small-data classification for fifteen years. Random forests, introduced by Leo Breiman in 2001, and gradient boosting, formalised by Jerome Friedman in the same year, provided ensemble methods that remain competitive on tabular data into the present day; they are the methods you should reach for first when you have a structured dataset and want a strong baseline. The reinforcement-learning textbook of Richard Sutton and Andrew Barto (1998) consolidated the temporal-difference and policy-gradient frameworks that AlphaGo and its descendants would later draw on.

The phase produced several visible public milestones. Deep Blue, the chess machine built at IBM by Feng-hsiung Hsu, Murray Campbell and Joseph Hoane Jr, lost a six-game match against the reigning world champion Garry Kasparov in 1996 and won the rematch 3½–2½ in 1997. Deep Blue combined specialised hardware (running about 200 million chess positions per second), a hand-tuned evaluation function and a search algorithm with quiescence and selective extensions. TD-Gammon, written by Gerald Tesauro at IBM between 1992 and 1995, achieved world-class backgammon play through pure self-play reinforcement learning. The DARPA Grand Challenge of 2005 saw Stanford's autonomous vehicle Stanley, led by Sebastian Thrun, complete a 132-mile desert course in under seven hours, the first autonomous vehicle to do so; the 2007 Urban Challenge extended this to mock urban driving. IBM Watson, in 2011, won the Jeopardy! championship against the human champions Ken Jennings and Brad Rutter, combining large-scale information retrieval, statistical natural-language processing and decision-theoretic answer scoring.

By 2011 AI had matured into a respectable, well-funded academic and industrial discipline. The expert-system winter was a fading memory. Statistical and probabilistic methods were standard. Neural networks had returned, in modest form, in research labs working on speech and handwriting. AI was, in retrospect, on the eve of the largest single methodological transition in its history; almost no one in 2011 saw it coming with any clarity.

Phase 5: the deep learning revolution (2012–2017)

The transition was precipitated by three developments that converged within roughly five years.

The first was data. ImageNet, assembled between 2007 and 2009 by Fei-Fei Li and her collaborators at Princeton (and later Stanford), is a dataset of over 14 million labelled images across roughly 22 000 categories, structured according to the WordNet noun hierarchy. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC), launched in 2010, focused on a 1000-category subset. ImageNet provided, for the first time, a labelled image dataset large enough to train a very deep model without it immediately overfitting.

The second was compute. NVIDIA's introduction of the CUDA programming model in 2007 made graphics-processing units, originally built to render triangles for video games, programmable for general numerical computation. The dense matrix multiplications that dominate the inner loop of neural-network training turn out to be exactly what GPUs are good at, an order of magnitude faster than CPU.

The third was the result. The AlexNet team at the University of Toronto, Alex Krizhevsky, Ilya Sutskever and their advisor Geoffrey Hinton, trained an eight-layer convolutional network on two NVIDIA GTX 580 GPUs over six days, using a small set of training tricks that were not new individually but had not been combined at this scale: rectified linear unit (ReLU) activations, dropout, aggressive data augmentation, GPU parallelism. AlexNet won ILSVRC 2012 with a top-5 error of 15.3 per cent, ten percentage points below the second-place entry, which used hand-engineered features and a support vector machine. The paper, "ImageNet Classification with Deep Convolutional Neural Networks" (Krizhevsky, Sutskever and Hinton 2012), is by citation count one of the most influential computer-science papers of the twenty-first century. By the 2014 challenge, every serious entry was a deep network. The hand-engineered-features pipeline, which had been the state of the art in computer vision for fifteen years, was simply gone.

The years that followed saw deep learning sweep visual recognition, speech recognition, machine translation and game playing in turn. VGG (Karen Simonyan and Andrew Zisserman 2014) showed that stacking many small convolutional layers was sufficient. GoogLeNet (Christian Szegedy and colleagues 2014) introduced the Inception module, a way of letting the network choose its own filter sizes. ResNet (Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun 2015) introduced residual connections, skip-paths around layers, that made it practical to train networks hundreds of layers deep, and brought ImageNet error below the rate at which humans agree with each other on the labels.

In parallel, natural-language processing saw its own transformation. Word2vec (Tomáš Mikolov and colleagues at Google 2013) demonstrated that semantic structure could be captured in dense vector embeddings learnt from raw text, that the analogy "king is to man as queen is to woman" could be solved by simple vector arithmetic. GloVe (Jeffrey Pennington, Richard Socher and Christopher Manning 2014) gave a competitive alternative based on global co-occurrence statistics. Sequence-to-sequence learning with attention, Sutskever, Vinyals and Le 2014, then Bahdanau, Cho and Bengio 2015, made neural machine translation competitive with the previous statistical state of the art and rapidly surpassed it; by 2016 production translation systems at Google were neural. Generative adversarial networks (Ian Goodfellow and colleagues 2014) introduced the framework of pitting a generator against a discriminator and produced increasingly photorealistic synthetic images. AlphaGo (David Silver and the DeepMind team 2016), combining a Monte Carlo tree search with deep convolutional policy and value networks trained on human games and self-play, defeated Lee Sedol 4–1 in March 2016, a result the field had thought ten years away. AlphaGo Zero (DeepMind 2017), trained tabula rasa from self-play with only the rules of Go, surpassed the original within three days and went on to defeat the more advanced AlphaGo Master within forty days.

In June 2017 Ashish Vaswani and seven colleagues, split between Google Brain, Google Research and the University of Toronto, published "Attention Is All You Need", introducing the Transformer architecture: a sequence model built entirely on multi-head self-attention, with no recurrence, no convolution, and a parallelism profile that made it possible to train at a scale that LSTMs simply could not match. The paper was published as a regular conference contribution; it would become the hinge on which the whole next phase turned.

Phase 6: the era of foundation models (2018–2023)

The Transformer architecture turned out to scale, in a way that no one expected. BERT, Bidirectional Encoder Representations from Transformers, by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova at Google in 2018, pre-trained a Transformer encoder on masked language modelling on a large corpus of web text, then fine-tuned it for a wide range of downstream NLP tasks. BERT set a new state of the art on every standard NLP benchmark and made the pre-train-then-fine-tune recipe the new default of the field. GPT, Generative Pre-trained Transformer, by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever at OpenAI in 2018, and GPT-2 (Radford, Wu, Child, Luan, Amodei and Sutskever 2019) took the alternative path: a left-to-right Transformer decoder trained on next-token prediction, scaled aggressively. GPT-3 (Tom Brown and many colleagues at OpenAI 2020), at 175 billion parameters, generalised this further: prompted with a handful of examples, it performed translation, summarisation, question answering and arithmetic at rates competitive with task-specific fine-tuned models. The phenomenon that came to be called in-context learning, the model's ability to acquire transient task-specific behaviour from prompt examples without any update of its weights, was recognised as a qualitatively new capability and arguably the surprise of the decade.

The Stanford Center for Research on Foundation Models, in a 2021 report led by Rishi Bommasani, coined the term foundation model for pre-trained models that can be adapted to a wide variety of downstream tasks. The label stuck. The four years that followed saw foundation models in every modality. CLIP (Alec Radford and colleagues at OpenAI 2021) trained a joint embedding of images and the text that describes them, giving a model that could classify images in zero shot from text labels alone. DALL·E (Aditya Ramesh and colleagues 2021) and Stable Diffusion (Robin Rombach and colleagues 2022) brought text-to-image generation to a level of quality that triggered enormous public interest and equally enormous controversy in the visual arts. Whisper (Alec Radford and colleagues 2022) achieved robust speech recognition across 99 languages from a single multilingual encoder-decoder. AlphaFold 2 (John Jumper and the DeepMind team 2021) solved a fifty-year-old problem in computational biology by predicting protein structures from amino-acid sequence with near-experimental accuracy. Codex and GitHub Copilot (OpenAI and GitHub 2021) brought neural code completion into the daily workflow of software engineers.

The qualitative public turn came on 30 November 2022, when OpenAI released ChatGPT as a free research preview. ChatGPT was, in technical terms, GPT-3.5 fine-tuned with reinforcement learning from human feedback (RLHF, formalised by Long Ouyang and colleagues 2022) and wrapped in a chat interface. None of the underlying machinery was new. But the interface, a chatbox in front of a 175-billion-parameter language model fine-tuned to be helpful, honest and harmless, converted capabilities that had been latent in pre-trained models into a tool that ordinary users could wield directly. ChatGPT reached one million users in five days and one hundred million within two months, at the time, the fastest consumer-product adoption ever recorded at that point, before Meta's Threads launched in July 2023 and reached 100 million users in five days. The political, economic, regulatory, educational and labour-market consequences of that change are still working themselves out.

The technical infrastructure that made this phase possible deserves explicit mention. The number of floating-point operations used to train a frontier model rose from approximately $10^{21}$ FLOPs (GPT-2, 2019) to approximately $10^{23}$ (GPT-3, 2020) to approximately $10^{25}$ (GPT-4, 2023) to over $10^{26}$ (Gemini Ultra and GPT-4 Turbo, late 2023), five orders of magnitude in five years. The scaling laws of Jared Kaplan and colleagues (2020) and Jordan Hoffmann and colleagues (2022, the "Chinchilla" paper) provided an empirical relationship between compute, data, parameters and final loss that the leading laboratories used to plan training runs at the scale of $10^{25}$ FLOPs. Specialised hardware, NVIDIA H100 and B200 GPUs, Google TPU v5 and v6, Amazon Web Services Trainium, and clusters of tens of thousands of accelerators connected by InfiniBand or NVLink-NVSwitch fabrics became the substrate of the field. The cost of a single training run for a frontier model rose into the hundreds of millions of dollars. We will return to the implications in §1.6.

Phase 7: reasoning, agents, and the frontier (2023–present)

The most recent phase of AI is still unfolding as this textbook is being written, which makes it harder to characterise from inside it; the picture sketched here will look partial in ten years' time. With that caveat, three threads stand out.

The first is reasoning. GPT-4 (OpenAI, March 2023), Claude 3 (Anthropic, March 2024) and Gemini 1.5 (Google DeepMind, February 2024) each represented a significant step beyond GPT-3 in tasks requiring multi-step inference, mathematical reasoning and code synthesis. The more striking development, however, was the explicit separation of fast and slow thinking, an architectural and training choice that allowed models to spend variable amounts of compute on a problem depending on its difficulty. OpenAI's o1 (preview September 2024, full release December 2024) and o3 (announced December 2024) were trained to perform extensive chain-of-thought reasoning in a hidden scratchpad before producing a final answer, with the length of the reasoning trace traded off against compute. DeepSeek-R1 (released by the Chinese laboratory DeepSeek in December 2024 and January 2025) demonstrated that comparable reasoning capabilities could emerge from large-scale reinforcement learning applied to a base language model with a verifiable reward, typically correctness of mathematical or code outputs, without supervised reasoning data, and DeepSeek released the weights openly. The relationship between language modelling, reasoning and reinforcement learning, which had seemed cleanly separated as recently as 2022, is being rewritten in real time.

The second thread is agency. Tools that allow a language model to call functions, to browse the web, execute code, query databases, edit files, interact with operating systems, turn the model from an answer-emitter into a system component that can take actions over multiple steps. Anthropic's Claude introduced "computer use" in late 2024, allowing the model to take screenshots and manipulate a graphical desktop directly. Devin, announced by Cognition Labs in 2024, framed autonomous software engineering as the target task; open-source alternatives followed within weeks. The SWE-Bench benchmark, which evaluates a system's ability to resolve real GitHub issues in open-source Python repositories, on which an early-2024 system scored under 5 per cent, is at over 85 per cent by early 2026 (Claude Opus 4.7 87.6 per cent, GPT-5.5 ~83 per cent). Long-horizon task completion, taking a high-level user goal, decomposing it into subtasks, executing those subtasks across hours or days, recovering from errors, asking for clarification when needed, remains the central engineering problem of the agentic era and the central source of its current limitations.

The third thread is integration with the rest of science. AI is no longer an external discipline that is occasionally applied to scientific problems; it has been absorbed into the methodological toolkit of several scientific fields. AlphaFold 3 (Josh Abramson and colleagues 2024) extends the protein-structure programme to protein–ligand and protein–nucleic-acid complexes, putting structure prediction at the level of a routine laboratory technique. GNoME (Amil Merchant and colleagues at Google DeepMind 2023), using graph neural networks, catalogues 2.2 million previously unknown candidate crystal structures, of which roughly 380,000 are predicted stable: a roughly nine-fold expansion of the previously known stable inorganic-crystal catalogue. GraphCast (Remi Lam and colleagues 2023) and Pangu-Weather (Kaifeng Bi and colleagues 2023) match or exceed traditional numerical weather-prediction models at a small fraction of the compute cost; the world's leading meteorological agencies have integrated these models into operational forecasts. RFDiffusion (Joseph Watson and colleagues 2023) designs novel proteins from scratch with desired binding properties. The interaction between AI and biology, chemistry, materials science and meteorology has, by the mid-2020s, become reciprocal rather than one-directional.

What distinguishes this phase from its predecessors is partly its scale and partly its character. The frontier laboratories, OpenAI, Anthropic, Google DeepMind, Meta AI and xAI in the United States; DeepSeek, Alibaba, Baidu, Zhipu, Moonshot and Tencent in China; Mistral and Aleph Alpha in Europe, are now consuming compute and capital at a rate that places them in a different economic regime from anything academic AI research has known. A single training run can cost more than a major university department's annual budget. The frontier and the long tail of AI research have, in consequence, separated. Most of the visible capability advances now come from a small number of well-capitalised industrial laboratories; the academic community has reorganised itself around the questions, mostly methodological and theoretical, that can be addressed at smaller scale. §1.7 and Chapter 17 take up the implications of the split.

What you should take away

  1. AI history is seven phases, not one arc. Cybernetics (1943–1956), the symbolic era (1956–1973), knowledge-based systems (1973–1987), the connectionist revival and statistical turn (1987–2011), deep learning (2012–2017), foundation models (2018–2023), and the present era of reasoning, agents and frontier integration with science (2023–). Each phase has its own intellectual centre of gravity, characteristic systems and reasons for ending; no single thread runs continuously from then to now.

  2. The phases overlap, and good ideas almost never die. Lisp (1958), A* (1968), backpropagation (1986), convolutional networks (1989) and the Transformer (2017) are all still in active use. A historical map of AI is also a map of the working toolkit; the parts that look like archaeology often turn out to be load-bearing.

  3. There have been two AI winters, and both followed periods of overpromise. The 1973 winter followed the optimistic predictions of the 1956–1965 symbolic era; the 1987 winter followed the expert-systems boom and the collapse of the Lisp-machine market. The pattern, exuberant projection, partial delivery, retrenchment, is worth keeping in view as the present phase continues.

  4. The 2012 deep-learning transition was driven by data, compute and a few training tricks, not by a fundamentally new theoretical idea. ImageNet, CUDA-programmable GPUs and AlexNet's combination of ReLUs, dropout and aggressive data augmentation produced a discontinuous capability jump on top of an algorithm, backpropagation through convolutional layers, that had been known for over twenty years. The shape of that transition is a useful prior on future ones.

  5. The frontier and the long tail have separated. Since approximately 2020, the most visible capability advances have come from a small number of well-capitalised industrial laboratories. Academic AI continues to do essential work on foundations, theory, applications, robustness and societal questions, but it is no longer the locus of frontier capability development. Chapters 16 and 17 take up what this separation means for the practising engineer, the researcher, the student and the society in which AI is increasingly embedded.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).