Glossary

AI Safety

AI Safety is the interdisciplinary field concerned with ensuring that artificial intelligence systems operate without causing unintended harm. Unlike fairness and privacy, which address specific categories of harm in deployed systems, AI safety takes a broader perspective, examining failure modes that range from mundane software bugs to catastrophic risks from highly capable future systems. Its central challenge: how do we build AI that does what we want, even as it becomes more powerful and autonomous?

Near-term safety concerns include specification gaming (optimisers finding unintended strategies satisfying the literal objective), distributional shift (models failing when deployed conditions differ from training), adversarial examples (crafted inputs fooling classifiers), and reward hacking. Robust deployment requires extensive testing, red-teaming (adversarial probing), monitoring, graceful failure handling, and the ability to shut down or correct systems when they malfunction.

Longer-term concerns focus on the alignment problem—ensuring AI objectives match human values as systems grow more capable—and related issues like corrigibility (can we still correct or shut down a sufficiently capable agent?), scalable oversight (how do we supervise AI in domains where humans cannot easily judge correctness?), and the possibility of recursive self-improvement leading to an intelligence explosion. While opinions differ on timelines and probabilities, the potential severity of worst-case outcomes makes AI safety research a high priority. Leading organisations including Anthropic, OpenAI, DeepMind, and MIRI have dedicated teams working on these problems, and governments increasingly recognise AI safety as a national priority.

Related terms: AI Alignment, Adversarial Example

Discussed in:

Also defined in: Textbook of AI