- Identify sources of algorithmic bias and contrast fairness metrics such as demographic parity, equal opportunity, and calibration
- Explain methods for making models interpretable (SHAP, LIME, integrated gradients, attention visualisation)
- Discuss privacy-preserving techniques including differential privacy, federated learning, and secure aggregation
- Outline AI safety risks ranging from specification gaming and reward hacking to misuse and catastrophic risk
- Summarise the current landscape of AI regulation (EU AI Act, NIST AI RMF, voluntary commitments)
A chapter on ethics in a technical textbook can read like a sermon tucked between the equations. This one tries not to. The aim is to give the working AI engineer the same tools for thinking about safety, fairness, privacy and alignment that the previous chapters gave for thinking about gradients, attention and decoding. That means definitions, mathematics where the mathematics is honest, and named papers you can read for yourself.
The chapter is opinionated only in one direction: it takes seriously that systems trained on next-token prediction at frontier scale do unexpected things, that the literature contains real disagreements about how worried to be, and that "we will figure it out later" is not a research programme. It also takes seriously the more prosaic harms that have already shipped: biased classifiers, leaked training data, deepfaked voices. These are the ones engineers ship code against today.
Two warnings about the chapter itself. First, parts of it will date faster than the rest of the textbook. The policy snapshot in section 16.18 is current as of April 2026; by the time you read this, individual fines, summit declarations and executive orders will have moved on. The conceptual structure, risk tiers, compute thresholds, AISI evaluations, is more stable, and that is what the section emphasises. Second, the chapter is written assuming you are an engineer building or deploying systems. A reader interested primarily in policy or philosophy will find the technical sections (especially 16.8, 16.11, 16.15) longer than they need to be; a reader interested in only the technical content will find the policy and history sections shorter than they would like. The intended audience is the working practitioner who needs both.
In this chapter
- 16.1 Why ethics in an AI textbook
- 16.2 A short history of AI ethics
- 16.3 Outer alignment
- 16.4 Inner alignment and mesa-optimisation
- 16.5 Goodhart's law and reward hacking
- 16.6 Specification gaming
- 16.7 RLHF failure modes
- 16.8 Adversarial attacks
- 16.9 Jailbreaks and prompt injection
- 16.10 Data poisoning and backdoors
- 16.11 Mechanistic interpretability
- 16.12 ELK: eliciting latent knowledge
- 16.13 Scalable oversight
- 16.14 Bias and fairness
- 16.15 Privacy and data protection
- 16.16 Deepfakes, watermarking, content provenance
- 16.17 Responsible Scaling Policies
- 16.18 AI policy as of April 2026
- 16.19 The case for urgency
- 16.20 The case for restraint
- 16.21 What an AI engineer can do