Applications: 17.10   Education

Dr Chris Paton

17.10 Education

The dream of a personal tutor for every learner is older than the printing press. Quintilian, writing in the first century, argued that the wealthy Roman parents who hired private tutors did so because they believed individual attention was qualitatively different from the bench-and-blackboard schooling available to everyone else. In 1984 the educational psychologist Benjamin Bloom formalised this intuition with what he called the two-sigma problem: students taught one-to-one by a competent tutor, with mastery learning, performed roughly two standard deviations better than students in a conventional classroom. The size of the gap is staggering, the average tutored student outscored ninety-eight per cent of conventionally taught peers. Bloom posed the problem and could not solve it. Hiring human tutors at scale is unaffordable for most families and almost all education ministries. The hope, repeated across every generation of educational technology from teaching machines to MOOCs, is that some new tool will close the gap at acceptable cost. Large language models are the latest candidate. They can hold a conversation, mark essays, generate worked examples and answer questions at any hour, and they cost a few dollars per student per month rather than fifty dollars per hour. The question is whether they actually teach.

Section 17.9 looked at how AI is reshaping decisions about money. This section turns to decisions about minds, what students learn, how teachers teach and how schools assess.

Khanmigo (Khan Academy 2023)

Khan Academy announced Khanmigo in March 2023, only four months after ChatGPT's public release, as part of a partnership with OpenAI built on early access to GPT-4. Sal Khan's framing was deliberately Bloomian: an AI tutor that approximated the patient, Socratic style of a good human tutor, paired with a teacher assistant that handled lesson planning, marking and parent communication. The pedagogy was the distinguishing choice. Khanmigo was prompted, fine-tuned and reinforced never to give the answer to a homework problem outright. It asks the student to read the question aloud, restate it in their own words and propose a first step. If the student is stuck it offers a hint at the level immediately below where they are stuck, then escalates only if necessary. If the student gets the answer wrong, it asks them to find the error themselves before pointing it out. It maintains the conversation across turns, so a student working through a long algebra problem does not have to reintroduce the context each time.

By 2025 Khanmigo had passed one million student users and had been deployed in a growing list of school districts, including Newark Public Schools, Hobbs Municipal Schools and a number of charter networks. Subscription pricing for individual families sits around forty-four dollars per year; districts pay per seat at a discount. The teacher-facing features are arguably the more consequential half of the product: Khanmigo will generate a lesson plan aligned to a state standard, draft a rubric, mark a stack of short-answer responses and flag students who appear to be struggling, freeing the teacher for the human work of motivation and pastoral care. Anecdotal reports from teachers have been positive. Rigorous randomised evaluation evidence is, as of 2026, still limited; Khan Academy has commissioned external evaluations but the publicly available data falls well short of the multi-site randomised controlled trials that would be needed to make a confident claim about effect size.

Duolingo Max

Duolingo Max, launched alongside Khanmigo in March 2023, is a higher subscription tier that adds two GPT-4-powered features to the existing app. Roleplay drops the learner into a scenario, ordering coffee in Paris, complaining about a hotel room in Madrid, attending a job interview in Tokyo, and lets them carry on a free-form conversation with the AI in the target language, with the AI gently steering and grading. Explain My Answer lets the learner click on a question they got wrong and ask why, with the AI generating a tailored explanation rather than the canned one written by a human content team.

Duolingo's strategic shift went further than these two features. In April 2024 chief executive Luis von Ahn announced that the company was "AI-first" and that contractor translators were being phased out as LLMs took over content generation. The announcement attracted criticism from the labour movement and from learners who valued the human touch in the curriculum, but the financial logic was hard to dispute: the cost of producing a new course had collapsed by an order of magnitude, allowing Duolingo to add languages and content that would have been uneconomic before. The Max tier is priced at around thirty dollars per month, well above the standard Super tier, and the company reports that LLM features account for a meaningful and growing share of new subscriptions.

Issues

The cluster of concerns about AI in education is now well-rehearsed in the popular and professional press, and each deserves to be taken seriously rather than waved away.

Cheating. Take-home essays, problem sets and coding assignments are no longer trustworthy evidence of student learning. Detection tools, Turnitin's AI-detector, GPTZero and a long list of competitors, have repeatedly been shown to produce both false positives and false negatives at rates that make them unsuitable for high-stakes use, and they discriminate against students who write in a second language. The institutional response has been a return to invigilated written examinations, oral examinations and process-based assessment, in which students must show drafts, version histories and conversations with the AI as part of the submission. This is a sensible adjustment but it shifts the cost back onto teachers, who must now design and mark fundamentally different assessments.

Dependence. A second concern is that students who outsource their thinking to a chatbot do not develop the underlying skills. The evidence on this is mixed and often confounded with the cheating worry. Studies in mathematics and writing suggest that learners who use an LLM to scaffold a difficult task, asking for a hint, an example, a critique, perform better than those who do not, while learners who use the LLM to substitute for the task perform worse on later unaided tests. The pedagogical lesson, familiar from Vygotsky's zone of proximal development, is that the productive use of a tutor is at the boundary of what a learner can do alone, not on the inside of it.

Bias and hallucination. LLMs produce confident-sounding factual errors, and they reproduce the biases of their training corpora. In a tutoring context this is more dangerous than in casual use because a student who does not yet know the material has no way to spot the mistake. A tutor that confidently teaches that the French Revolution began in 1788, or that mitochondria are found only in animal cells, has done active harm. The product response, retrieval-augmented generation against a curated curriculum, refusal to answer outside the curriculum, citation of sources, mitigates but does not eliminate the risk.

Equity. The cost of an AI tutor sits well below the cost of a human tutor but well above zero, and the families that can afford the subscription are the ones whose children already have the most educational support. A widening gap is plausible, though the same pattern obtained for previous waves of educational technology, calculators, personal computers, internet access, and was eventually closed by public provision. Whether public provision arrives in time to prevent a generation-shaped gap is a policy question rather than a technical one.

Where it works

There are corners of education where AI tutors already earn their keep. Personalised drilling in mathematics, where the answers are unambiguous, the feedback loop is tight and the right number of similar problems is exactly what the student needs, is a strong fit; tools such as Khanmigo, Squirrel AI Learning in China and Photomath all show the pattern. Conversational language practice, speaking in the target language with an infinitely patient interlocutor who does not judge accent or hesitation, is one of the few places where an AI tutor is plausibly better than the available human alternatives, because the alternative for most learners is no conversation partner at all. Code learning on platforms such as Codecademy, Replit Ghostwriter and the various "explain this code" features built into editors gives the learner immediate feedback in a way that pre-LLM tools could not. And across all subjects, concept clarification, "explain it again, this time as if I am a child" or "give me three different analogies", works well precisely because the LLM is not bound by the single explanation in the textbook. The learner who has bounced off the printed exposition can probe a chatbot from multiple angles until something clicks.

Where it doesn't

The same technology is at best unhelpful, and at worst harmful, in other corners. Long-form writing instruction is the obvious case: the AI is so good at writing that it tempts the student into delegation, and delegation defeats the purpose. Subjects that require physical practice, sports, musical performance, surgery, cannot be tutored over a chat interface, although adjacent uses such as feedback on video or technique annotation are emerging. Subjects that depend on supervised collaboration, laboratory work, group critique in design education, ward rounds in clinical training, lose something essential when an individual student talks to an individual chatbot rather than working through the material with peers and a more experienced human. And early childhood, where the social and emotional content of teacher-pupil interaction matters as much as the cognitive content, is a setting where screen-based AI tutoring is at best premature.

What you should take away

Bloom's two-sigma effect remains the benchmark: a competent one-to-one tutor with mastery learning lifts the average student to the ninety-eighth percentile, and AI tutors are now affordable enough for the comparison to matter.
Khanmigo is the canonical pedagogically careful product, Socratic, hint-laddered, anti-answer-giving, and Duolingo Max is the canonical conversational-practice product, with each suggesting a viable shape for AI in education at scale.
Cheating, dependence, hallucination and equity are real, but the institutional response, invigilated assessment, scaffold-not-substitute pedagogy, retrieval-grounded tutoring, public provision, is straightforward in principle if costly in practice.
The applications that work best are those with unambiguous feedback loops (maths drills, code), low-stakes conversational practice (languages) and concept clarification on demand; the applications that work worst are those where the AI does for the student what the student needs to learn to do.
Empirical evaluation lags far behind deployment; the closest evidence base, intelligent-tutoring-system research from the 1990s and 2000s, suggested effect sizes of 0.3–0.7, and there is as yet no convincing public evidence that LLM-based tutors do meaningfully better.