1.2 The Turing test and its successors

The Turing test is the most famous philosophical proposal in artificial intelligence. Its claim is simple: if you cannot tell a machine apart from a human in conversation, you ought to grant the machine intelligence. The proposal has shaped seventy-five years of public discussion of what AI is for and what would constitute success. It has also been misunderstood almost as often as cited, and modern evaluation has moved a long way beyond it.

The test is worth understanding for two reasons. It remains the conceptual touchstone of the field; almost every popular discussion of AI eventually circles back to it. And the way it has been outgrown, what newer evaluations measure that Turing's did not, and why, is itself a useful lesson in how AI is assessed in practice.

§1.1 gave the rational-agent definition: a system that perceives its environment and acts to maximise expected success. That is a normative picture, what an agent ought to do, not how we should recognise one when we meet it. The Turing test takes the behavioural alternative: judge intelligence by what the system does in conversation, not by what it computes inside.

Turing's 1950 paper

Alan Turing (1912–1954) was, by 1950, already among the most consequential mathematicians of the century. His 1936 paper "On Computable Numbers" had defined the abstract device we now call a Turing machine and, with it, the formal limits of what any computing device could ever do. During the Second World War he led the Hut 8 team at Bletchley Park whose work on the German Naval Enigma cipher is generally credited with shortening the war by several years. After the war he worked at the National Physical Laboratory and then at the University of Manchester, contributing to one of the world's first stored-program computers, the Manchester Mark 1. The 1950 paper, published in the philosophy journal Mind under the title "Computing Machinery and Intelligence", was written from that Manchester position.

The paper opens: "I propose to consider the question, 'Can machines think?'". Turing then argues the question, as posed, is too ill-defined to be useful. The words "machine" and "think" carry too much philosophical baggage; any answer depends on definitions of consciousness, of mind, of intentionality, on which there is no agreement and on which there is unlikely soon to be any. Rather than wade into that swamp, Turing proposes to replace the original question with one that admits an experimental answer.

The replacement question is built around what he calls the imitation game. In its original form, the game involves three players: a man (A), a woman (B), and an interrogator (C) of either sex. C is in a separate room and communicates with A and B only by typed messages, so that voice and appearance cannot give the game away. C's task is to determine which of A and B is the man and which the woman. A is instructed to deceive C; B is instructed to help. After describing this parlour amusement, Turing asks his celebrated question: "What will happen when a machine takes the part of A in this game? Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, 'Can machines think?'"

Several features of the proposal are routinely missed. First, the interrogation is open-ended: nothing prevents C from asking arbitrary questions about poetry, arithmetic, sport, the weather, or what one had for breakfast. Second, the test is mediated by typed text precisely so that physical embodiment, accent, and appearance do not contaminate the judgement; the test is of the mind, not of the body. Third, Turing did not claim the test as a necessary condition for thought, only as a sufficient one for legitimately attributing thought; a being might think without being able to pass the test, just as a small child might think without being able to win at chess. Fourth, the test deliberately sidesteps the philosophical question of consciousness; it offers an operational substitute, an empirical procedure that one can either pass or fail, rather than a metaphysical decision about inner experience. Turing was perfectly aware that this manoeuvre would dissatisfy some philosophers, and he devoted much of the paper's second half to addressing nine objections from theology, mathematics, and informal common sense.

Turing's predictions were generous but not extravagant. He famously wrote: "I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted." He also predicted that within fifty years, programs running on machines with around 10⁹ bits of storage would play the imitation game well enough that an average interrogator would have no more than a 70 per cent chance of correctly identifying them after five minutes. On a strict reading the first prediction was reached in the 2010s by chatbots engineered to exploit the test's social dynamics, and decisively by general-purpose large language models in the 2020s. The prediction about ordinary linguistic usage has clearly arrived: people speak today, without any sense of strangeness, of a chatbot that "thinks" the answer is such-and-such or "knows" who wrote Hamlet.

A worked dialogue

To make the test concrete, consider the following dialogue with an unspecified interlocutor. Try to form a judgement about whether the respondent is human or machine.

C: Good afternoon. May I begin?

X: Of course. Please go ahead.

C: I wrote a sonnet that begins, "Shall I compare thee to a summer's day." Would you say a winter's day would do as well?

X: It would scan, but it would not do as well. The whole point of the conceit is that the beloved is more lovely and more temperate than something already lovely; comparing the beloved to a winter's day reverses the contrast and breaks the argument of the poem. A spring day might work, though Shakespeare's particular line cannot easily be rewritten without losing its rhythm.

C: Add the following numbers: 34957, 70764. Take your time.

X: 105721.

C: Are you sure? Most people make at least one carrying error on a sum that size.

X: I checked it twice. The units, tens, and hundreds columns sum to 11, 12, 17 respectively (with carries of 1, 1, 1 propagating left); the thousands column then sums to 4 + 0 + 1 = 5, and the ten-thousands column to 3 + 7 = 10. The total is 105721.

C: Do you play chess?

X: A little, badly. I know the rules. I would not last long against a club player.

C: Here is a position: white king at e8, white rook at h1, black king at e4. White to play. What is the best move?

X: Rook to h4 check looks immediately threatening, but it lets the black king flee to d5 or f5 with no follow-up. The mate-in-one is not available. I would play Rh1–e1, pinning nothing but cutting the king off from the e-file and preparing to walk the white king closer. With careful technique White wins, but slowly.

This is exactly the kind of mixed-mode interrogation Turing had in mind. There is a literary judgement that requires reading comprehension and a sense of poetic argument; an arithmetic problem that, with its follow-up, probes for the kind of error a hurried human is likely to make; a chess problem that requires spatial reasoning and an accurate appraisal of the position rather than a glib reply; and conversational adjustments of register throughout, with X's modesty about chess and willingness to show working. Before reading on: human or machine?

In 2026 you cannot tell from the dialogue alone. A capable contemporary large language model, Claude, GPT-4, Gemini, would produce something not very different. So would a competent human chess player who happened to remember some Shakespeare. The signals that older textbooks taught students to look for, the arithmetic mistake a tired human would make and a calculator would not, the failure to recognise the literary genre, the inability to give a sensible verbal account of a chess position, have been substantially erased. Not because machines have become human-like in their underlying cognition, but for a much subtler reason: they have absorbed a sufficiently large fraction of the human textual corpus that they can now mimic the surface signs of literary, mathematical, and conversational competence. They have read what humans have written about sonnets, about how to add columns of figures with carries, about chess endgames; and they can produce text that exhibits the patterns of someone who knows those things.

This is itself a striking finding. A test that for fifty years served as a useful, if rough, behavioural marker of intelligence has been made unreliable not by a breakthrough in cognition but by a breakthrough in coverage: the field has assembled enough text and enough compute to fit a model that reproduces the textual fingerprints of a competent human. Whether this constitutes intelligence, or merely a very good imitation of it, depends on how seriously one takes the difference between mechanism and behaviour, a question §1.3 takes up directly.

The dialogue also illustrates why the test has lost its discriminating power even when the questions are well-chosen. A confident assertion that the sum is 105721, with the working shown, is no longer evidence of a calculator hidden behind the curtain; today's language models can show their working in plausible English, and a modern LLM prompted to respond as a modest amateur will hedge in exactly the right places. The test has not become harder; the test pool has become much, much larger.

What still defeats modern systems

It would be misleading to say that the Turing test is now trivially passed in all conditions. There remain specific failure modes that, under sustained adversarial questioning, still tend to expose machines.

The first is multi-step reasoning that requires holding several mutually constraining facts in mind at once. A human can think about a logical puzzle whose solution requires keeping three or four interacting constraints simultaneously active and noticing that two of them conflict. Current models do this less reliably. They will often satisfy each constraint locally and miss the global inconsistency, especially when the constraints are introduced one at a time across a long conversation.

The second is recognising that one's own confidently-asserted previous answer was wrong. Humans, embarrassingly, are not perfect at this either; but they at least have the disposition to revise. A model that has just emitted a confident wrong answer often defends it on direct challenge, or constructs an elaborate but incorrect justification rather than concede. Calibration, the agreement between stated confidence and actual correctness, remains a weakness.

The third is sustained adversarial play, in which the interrogator deliberately introduces minor inconsistencies to see whether the respondent notices. A skilful interrogator will plant a small false premise three turns ago and check, twenty turns later, whether the respondent has quietly absorbed it. Humans usually notice; current models often do not, particularly across long contexts.

The fourth, and perhaps the most important, is tasks requiring genuinely novel reasoning rather than recombination of training-corpus patterns. Where a problem is unlike anything the model has seen, or requires an inference whose pattern is not represented in the training data, modern systems can fail in ways that look strange to a human observer. The failure is not always obvious from a single answer, but it becomes visible when the same kind of novel task is repeated with small variations.

These failure modes still tend to expose machines under careful adversarial probing, though much less reliably than they did in 2020. The gap is narrowing year on year. In its classical form the Turing test can no longer be relied on as a clean separator between humans and machines, and the field has had to invent more discriminating evaluations.

The Loebner Prize and its lessons

The most ambitious attempt to operationalise the Turing test in practice was the Loebner Prize, established by the American inventor Hugh Loebner in 1991 in collaboration with the Cambridge Center for Behavioral Studies in Massachusetts. The prize ran an annual restricted-form Turing test. A bronze medal was awarded each year to the system most often judged human in the contest's preliminary rounds. A silver medal was reserved for a system that could pass an unrestricted text-only Turing test, that is, one in which judges could ask anything they liked and were unable to identify the machine reliably. A gold medal was reserved for a system that could pass an audiovisual Turing test, in which the system would also have to handle vision and speech. The silver and gold prizes were never awarded.

Through the contest's last instalment in 2019 the bronze was won, year after year, by systems whose creators, Joseph Weintraub, Robert Medeksza, Bruce Wilcox, Steve Worswick, used pattern-matching tricks descended from Joseph Weizenbaum's ELIZA (1966) and Kenneth Colby's PARRY (1972). ELIZA was a simple rule-based program that simulated a Rogerian psychotherapist by reflecting the user's statements back as questions ("My head hurts." / "Why do you say your head hurts?"); PARRY simulated a paranoid patient. Neither understood anything in any meaningful sense. The Loebner winners were sophisticated descendants of these ideas, chatbots tuned to deflect, change subject, claim youth (a sixteen-year-old can plausibly not know things), and produce humorously evasive answers when cornered.

Two lessons emerge. The first is that the format was easy to game. A non-expert judge, given a five-minute typed conversation, can be fooled by a chatbot with no inner life whatever, provided it has been engineered to manipulate the social dynamics of the encounter. Pretend to be young; pretend to be tired; deflect difficult questions with a joke; turn the conversation back to the interrogator's interests; profess bafflement charmingly. None of this requires intelligence in any deep sense; it requires showmanship.

The second is more important. Gaming the Loebner Prize was largely orthogonal to building intelligent systems. The winners were ingenious pieces of social engineering, clever, often funny, illustrations of how human judgement of "humanness" can be exploited, but they were not advances in AI. The mainstream of the field paid the prize little attention, and the prize did not contribute to that mainstream. By 2019 the contest was wound down. Modern frontier models pass Loebner-style tests as a side effect of their general capabilities, with no pretence of being a sixteen-year-old. The deeper question, how to evaluate systems whose competence is broad, opaque, and prompt-sensitive, remains an active research problem.

The Loebner story is, in retrospect, a case study in Goodhart's law: when a measure becomes a target, it ceases to be a good measure. Turning the Turing test into a contest with a cash prize created an incentive to optimise the contest, not to advance the field. This pattern recurs throughout the history of AI evaluation.

Modern evaluation: benchmarks, arenas, adversarial probes

Three families of evaluation now dominate practical AI assessment, and a beginner is best served by understanding all three because they answer different questions.

Static benchmarks are fixed datasets of inputs paired with known correct answers. The model is given each input, its output is compared with the answer, and a score is computed. The earliest widely-cited modern benchmarks were GLUE (2018) and SuperGLUE (2019), which covered natural-language understanding tasks such as paraphrase detection and reading comprehension. These were largely saturated by 2020. Later benchmarks raised the bar substantially. MMLU (Hendrycks 2020) tests knowledge across 57 subjects, from elementary mathematics through professional law and medicine, with multiple-choice questions. HumanEval (Chen 2021) measures Python programming by asking the model to complete function bodies given docstrings. GSM8K (Cobbe 2021) tests grade-school arithmetic word problems; MATH (Hendrycks 2021) tests competition-level mathematics.

The advantages of static benchmarks are that they are reproducible, quick, and cheap to run, and they have driven much of the visible progress of the past five years. MMLU rose from 43.9% for GPT-3 in 2020 (with a random-guessing baseline of 25%) to over 90% for the strongest systems in 2024. The drawbacks are real: benchmarks are saturable. Once a benchmark is widely used, the gradient of model improvement tilts toward it, sometimes through outright contamination (the test data leaking into training corpora), sometimes through subtler forms of overfitting (researchers selecting hyperparameters that happen to favour the benchmark). Static benchmarks also struggle to capture open-ended capabilities: there is no benchmark for "is this an insightful idea" or "would this code work in a production system used by a million people".

Arenas are head-to-head human evaluations of model outputs. The dominant example is Chatbot Arena, run by LMSYS at the University of California, Berkeley. Users submit a prompt; two anonymised models produce responses; the user votes for the better one. Millions of votes are aggregated into Elo ratings, the same system used in chess. By early 2026 the leaderboard was crowded near the top, with Claude, GPT, and Gemini variants within a few tens of Elo points of each other and of strong open-weight models such as DeepSeek V3.2 / V4. Arenas are robust to specific benchmark contamination (there is no fixed test set to leak), but they reward responses humans prefer, not necessarily responses that are true. They suffer from style bias: verbose, well-formatted answers tend to win even when they are wrong. Arenas are best read alongside other evaluations rather than alone.

Frontier evaluations are deliberately constructed to be hard enough that current models score low. The best known is ARC-AGI, a corpus of visual reasoning puzzles assembled by François Chollet from 2019 onward and designed to test fluid abstraction that pre-training on text and images was not expected to yield. Through 2023 the best machine score remained below 35%; in late 2024 OpenAI's o3 system, using chain-of-thought reasoning under heavy compute, reached 87.5% on the semi-private set, prompting Chollet to release ARC-AGI-2 with harder puzzles. Humanity's Last Exam (HLE), a 2024 collaboration led by the Center for AI Safety with thousands of subject experts, contains around 3,000 graduate-level questions across mathematics, physics, biology, history, and other domains. At launch the best models scored under 10%; by early 2026 top models score in the low-to-mid forties (Gemini 3.1 Pro Preview ~44.7 per cent, GPT-5.5 ~44 per cent, Claude Opus 4.6 Thinking ~34 per cent). SWE-Bench (Jimenez et al. 2023, extended in SWE-Bench Verified 2024) measures the ability of agents to close real GitHub issues in real Python repositories; scores rose from under 5 per cent in early 2024 to over 85 per cent by early 2026 (Claude Opus 4.7 87.6 per cent, GPT-5.5 ~83 per cent), though SWE-Bench Verified is now widely considered contaminated and SWE-Bench Pro has emerged as the harder reference.

What the three families together reveal is that capability and evaluation co-evolve. Each fresh benchmark, after a few years, becomes a target the field aims at and eventually saturates; the field then proposes a harder one. ARC-AGI was claimed in 2019 to be a multi-decade challenge and was substantially solved within five years. HLE was meant to be unsolvable for a decade and lost a quarter of its difficulty in under a year. Whether this Whig history of upwardly-scrolling scores will continue indefinitely is one of the central empirical questions of the next decade. A beginner should not be too quick to assume either that benchmarks measure intelligence in any deep sense or that they are a fraud; they are imperfect, fast-moving, and indispensable, and a serious reader looks at several at once.

Beyond imitation: what the Turing test does not test

What the Turing test does not probe is now larger than what it does. Modern AI evaluation has fragmented because a five-minute typed conversation, however cleverly constructed, leaves out most of what one wants to know about an intelligent system.

The test does not ask whether the machine understands what it is saying. This was the philosopher John Searle's complaint in his 1980 Chinese Room argument: a system might pass the Turing test by mechanically manipulating symbols according to rules it does not itself comprehend, and we should not therefore credit it with understanding. Whatever one thinks of Searle's argument, the test plainly cannot distinguish a system that understands from one that merely behaves as though it does.

The test does not measure calibration. A model that confidently asserts falsehoods will, on average, pass a five-minute test more easily than a model that admits uncertainty; confidence is a strong cue of plausibility for human judges. But in real applications, a confidently-wrong answer is far more dangerous than a candid "I don't know".

The test does not measure long-horizon coherence. Passing a five-minute conversation is a different challenge from maintaining a consistent persona, project, or chain of reasoning over hours, days, or weeks.

The test does not measure multi-day task completion. Real-world utility, closing a GitHub issue, writing and revising a paper, conducting a literature review, depends on capabilities far beyond a single conversation.

The test does not measure appropriate refusal. A genuinely useful AI system must sometimes refuse instructions: requests for hazardous information, for content that would harm a user, for actions outside its competence. The Turing test rewards a system that produces plausible answers to anything; it has nothing to say about when a system ought to decline.

Modern AI evaluation, accordingly, has fragmented into a constellation of measures: accuracy on benchmarks, calibration (the agreement between stated probability and empirical frequency), robustness (performance under adversarial perturbations), alignment with declared values (refusing requests for hazardous information, declining to flatter, telling the truth even when it is unwelcome), and so on. The Turing test remains useful as a piece of conceptual hygiene, a clean answer to the lazy question "but what would it even mean for a machine to think?". It is no longer adequate as a working evaluation, and has not been for some years. Chapters 15 and 16 cover the modern evaluation landscape in detail.

What you should take away

  1. The Turing test was a piece of philosophical engineering, not a working benchmark. Turing replaced the unanswerable "Can machines think?" with the operational question of whether a machine could be reliably distinguished from a human in typed conversation.
  2. The test no longer cleanly separates humans from machines. Modern large language models, trained on the human textual corpus, can mimic the surface signs of literary, arithmetic, and conversational competence well enough to pass casual versions of the test as a side effect of their general capabilities.
  3. The Loebner Prize illustrated Goodhart's law. Turning the test into a contest produced ingenious pieces of social engineering, not advances in AI; the prize was wound down in 2019.
  4. Modern AI evaluation rests on three pillars: static benchmarks (MMLU, HumanEval, GSM8K, MATH) for reproducible comparison; arenas (Chatbot Arena) for human preference at scale; and frontier evaluations (ARC-AGI, HLE, SWE-Bench) deliberately designed to be hard. Each has known weaknesses, and serious assessment uses several together.
  5. The Turing test does not test what now matters most: understanding, calibration, long-horizon coherence, multi-day task completion, and appropriate refusal. These are the dimensions along which contemporary systems are judged, and they are the subjects of Chapters 15 and 16.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).