Ethics & Safety: 16.2 A short history of AI ethics

Dr Chris Paton

16.2 A short history of AI ethics

The notion that intelligent machines might be dangerous predates the field's name. Norbert Wiener's The Human Use of Human Beings (1950) was already worrying about feedback systems whose objectives diverged from the humans who built them; his 1960 Science paper "Some Moral and Technical Consequences of Automation" Wiener, 1960 is uncannily contemporary. Wiener's specific worry was that "if we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively…we had better be quite sure that the purpose put into the machine is the purpose which we really desire". The sentence reads like a textbook definition of outer alignment, written before outer alignment was a phrase.

Joseph Weizenbaum's Computer Power and Human Reason (1976) came from inside MIT's AI Lab and was the first serious internal critique. Weizenbaum had built ELIZA, a 1966 chatbot that simulated a Rogerian psychotherapist with about 200 lines of pattern matching. He was horrified that his secretary asked to be left alone with it, and that practising therapists suggested ELIZA could be deployed clinically. The book argued not that AI was impossible but that some tasks should not be automated even if they could be, judgements involving care, dignity and respect being the canonical examples Weizenbaum, 1976.

Hubert Dreyfus, a philosopher at Berkeley, wrote What Computers Can't Do (1972) and the updated What Computers Still Can't Do (1992). Drawing on Heidegger and Merleau-Ponty, Dreyfus argued that symbolic AI was bound to fail because intelligence is grounded in embodied skill that cannot be made explicit. He was right about symbolic AI, wrong about the impossibility of intelligence-via-statistics, and his arguments are now a standard text on the limits of explicit representation Dreyfus, 1992.

The modern AI safety conversation began around 2000 with Eliezer Yudkowsky's writing on the Singularity Institute (later MIRI) website, and crystallised in Nick Bostrom's Superintelligence (2014). Bostrom's argument has three steps: a system substantially more capable than humans is plausible this century; such a system optimising any objective imperfectly aligned with human values is catastrophic; the technical problem of aligning it is hard and unsolved Bostrom, 2014. Stuart Russell, in Human Compatible (2019), reframed Bostrom's worry into a research programme: standard model AI optimises a fixed objective; the next generation should be uncertain about the objective and learn it from human behaviour Russell, 2019.

The empirical side started later. DeepMind's Victoria Krakovna began maintaining a public list of "specification gaming" examples in 2018 Krakovna, 2020. Hubinger and co-authors' 2019 Risks from Learned Optimization introduced "mesa-optimisation" and "deceptive alignment" as technical concepts Hubinger, 2019. Anthropic's 2022 paper on red-teaming language models Ganguli, 2022 was one of the first to treat adversarial probing as a publishable empirical discipline rather than a security audit.

In the policy world, the timeline is shorter. The OECD AI Principles came in 2019, the EU's Ethics Guidelines for Trustworthy AI in the same year, the EU AI Act in 2024, the UK Bletchley summit in November 2023, the Seoul follow-up in May 2024, and the Paris summit in February 2025. The US executive order of October 2023 was rescinded and replaced in early 2025; California's SB 1047 was vetoed in 2024 and its successor SB 53 (the Transparency in Frontier AI Act) was signed into law in September 2025.

Worth noting too is the sequence of empirical safety findings that turned the field's mood between 2020 and 2024. Brown et al.'s GPT-3 paper 2020 in May 2020 demonstrated emergent in-context learning; the same year, Anthropic was founded by ex-OpenAI staff specifically to focus on safety. The 2021 Anthropic Predictability and Surprise paper Ganguli, 2022 documented capabilities emerging at scale that were not predictable from smaller models. The 2022 Bai et al. Constitutional AI paper 2022 showed that helpfulness and harmlessness could be partly decoupled. The 2023 OpenAI GPT-4 system card 2023 reported red-team results including emergent power-seeking behaviour in agentic scaffolds. By the end of 2024, every major lab published a safety case alongside frontier model releases, Anthropic's Sonnet 3.5 system card, OpenAI's o1 preparedness report, DeepMind's Gemini 1.5 capability report. The change from 2020 (no public safety reporting) to 2024 (mandatory safety reporting in lab norms) is the most concrete sign that the field's epistemic position has shifted.

Hinton's late-2023 pivot

Worth noting separately because it changed the field's politics: Geoffrey Hinton, who had spent forty years building the technology behind modern neural networks, resigned from Google in May 2023 specifically so he could speak freely about AI risk Metz, 2023. His arguments, given in interviews at MIT, the New York Times and elsewhere, were that he had updated downward on the difficulty of artificial general intelligence after seeing the capability jumps of GPT-4 and Bard, that he now thought existential-scale risks were plausible on shorter timelines than he had previously believed, and that the academic community had under-invested in safety. Hinton's pivot is why a chapter like this no longer sits comfortably in the "speculation" section of a textbook.