- Identify sources of algorithmic bias and contrast fairness metrics such as demographic parity, equal opportunity, and calibration
- Explain methods for making models interpretable (SHAP, LIME, integrated gradients, attention visualisation)
- Discuss privacy-preserving techniques including differential privacy, federated learning, and secure aggregation
- Outline AI safety risks ranging from specification gaming and reward hacking to misuse and catastrophic risk
- Summarise the current landscape of AI regulation (EU AI Act, NIST AI RMF, voluntary commitments)
In 2018, Amazon scrapped a hiring tool that scored women lower than men. The model had learned from a decade of past choices — choices shaped by human bias. It worked as designed. That was the problem.
AI now makes big calls in law, health, hiring, finance, and defence. A model that reads retinal scans can also encode training biases. It can deny care to groups poorly served by the data. It can leak private medical records if not locked down. Ethics is not a soft add-on. It is part of the work, from data to deployment.
This chapter covers six themes:
- Bias and fairness — finding and fixing unfair outcomes
- Explainable AI — making model choices clear to the people they affect
- Privacy — using data without exposing the people behind it
- Safety — stopping AI from causing harm by accident
- Alignment — making sure AI goals match human goals
- Rules and governance — steering AI toward the common good
16.1 Bias & Fairness
A hiring model trained on biased records will copy that bias at scale.
Bias in ML means the model makes errors that hurt certain groups more than others. Those groups are often defined by race, gender, age, or income Bender, 2021. Weidinger et al. describe a broader set of risks for large language models Weidinger, 2022.
Where Bias Comes From
It rarely comes from bad intent. It comes from:
- Training data that reflects past human choices (which were not fair).
- Feature choices that act as proxies for race or gender (e.g., zip code).
- Loss functions that do not penalise group-level harm.
- Deployment settings where a model trained in one context is used in another.
A crime risk tool trained on arrest data will reflect policing patterns — and any racial gaps in those patterns. Knowing where bias enters is the first step.
Fairness Metrics
Making "fair" precise is hard. Three widely studied metrics clash in most real settings:
- Demographic parity: positive predictions should be equally likely across groups.
- Equalised odds: true-positive and false-positive rates should match across groups.
- Calibration: among people given the same predicted risk, the actual rate should be the same across groups.
Chouldechova (2017) Chouldechova, 2017 and Kleinberg et al. (2016) Kleinberg, 2016 proved you cannot have both calibration and equalised odds when base rates differ. You must choose which metric best fits the values at stake.
Technical Fixes
- Pre-processing: reweight or resample the training data, or learn fair representations that strip group info while keeping signal.
- In-processing: add fairness penalties or constraints during training.
- Post-processing: adjust outputs after training — e.g., apply group-level thresholds to balance false-positive rates.
Each involves trade-offs. None is always best.
Beyond the Model
Fairness is not only a technical issue. Deciding which attributes count as "protected," which metric to use, and what gap is allowed — these are human and political choices. They need input from domain experts, affected people, and policymakers.
Bias can also work at a system level that no model-level fix can reach. A policing tool deployed where patrols already cluster in certain areas will deepen existing gaps through feedback loops.
Auditing
Break down your metrics by group. Run counterfactual tests — would the output change if a person's group were different? Toolkits like IBM AI Fairness 360, Google What-If Tool, and Microsoft Fairlearn make this easier. But a model that looks fair on a benchmark may fail under real-world shifts.
16.2 Explainable AI
You apply for a loan. A model says no. You ask why. Nobody can tell you.
Explainable AI (XAI) makes model choices clear to the humans they affect. The tension: the most accurate models (deep networks, large ensembles, transformers) are also the hardest to read. A linear model with ten weights tells a clear story. A network with hundreds of millions of weights does not.
Built-In vs After-the-Fact
Built-in (intrinsic): decision trees, rule lists, sparse linear models. You can read the logic directly. Rudin (2019) Rudin, 2019 argued that in high-stakes settings, you should prefer these over explaining a black box after the fact.
After-the-fact (post-hoc): probe the model's inputs and outputs to build a story. Can be model-agnostic or model-specific.
LIME and SHAP
LIME Ribeiro, 2016 perturbs the input, watches how the output changes, and fits a simple local model. The simple model's weights serve as the explanation.
SHAP Lundberg, 2017 uses game theory. Each feature gets a Shapley value — its average marginal contribution across all possible feature subsets. SHAP has nice formal properties (completeness, consistency). Exact Shapley values are expensive to compute, so SHAP uses approximations like KernelSHAP and TreeSHAP.
Gradient Methods for Deep Networks
Saliency maps compute the gradient of the output with respect to each input feature. For images, this highlights which pixels matter most. Integrated Gradients Sundararajan, 2017 fix a saturation problem by summing gradients along a path from a blank baseline to the actual input. Attributions sum to the full output difference (a formal completeness property).
Attention-weight plots are popular for transformers but debated — attention may not reflect causal contribution.
Concept-Based Methods
TCAV lets you define human-level concepts ("striped texture," "formal language") and measure how much a model's internal representations respond to them. This gives explanations domain experts can reason about.
Mechanistic interpretability goes deeper, reverse-engineering neural network circuits to find neurons that implement recognisable algorithms (induction heads, modular arithmetic).
Limits
There is no agreed definition of a "good" explanation. Studies show explanations can raise trust without improving choices — or cause overtrust in bad models. A post-hoc explanation may look plausible but be misleading. The EU's GDPR enshrines a right to explanation for automated choices, but its precise legal and technical meaning is still being worked out.
16.3 Privacy & Data Protection
Your medical records help train a model. That model, if attacked, can reveal you were in the data.
ML is hungry for data. Training sets may hold names, health records, financial data, biometrics, and locations. Even "anonymous" data can be re-identified through feature combinations (the mosaic effect). Trained models also leak: membership inference attacks can tell if a record was in the training set. Model inversion attacks can rebuild approximate inputs from outputs.
Differential Privacy
Differential privacy (DP) Dwork, 2006 is the most rigorous framework. A random process M satisfies (ε, δ)-DP if swapping any single person's record changes the output only slightly:
Pr[M(D) ∈ S] ≤ e^ε^ × Pr[M(D′) ∈ S] + δ
Smaller ε = stronger privacy, but usually lower accuracy. DP-SGD Abadi, 2016 clips per-sample gradients and adds calibrated noise at each training step.
Federated Learning
Raw data stays on users' devices. Only model updates (gradients or weight deltas) travel to a central server. McMahan et al. (2017) McMahan, 2016 proposed the framework, deployed at scale for next-word prediction and voice recognition. The server never sees raw data. But gradients can still leak information (gradient inversion attacks), so combining federated learning with DP gives stronger protection — at a cost to accuracy and speed.
Cryptographic Methods
Secure multi-party computation (SMPC): multiple parties jointly compute a function without revealing any party's data. Hospitals can train a model together without sharing patient records. Fully homomorphic encryption (FHE): compute directly on encrypted data. Historically too slow for large neural networks, but hardware and algorithmic advances are closing the gap.
Legal Landscape
- EU GDPR (2018): purpose limitation, data minimisation, rights of access, erasure ("right to be forgotten"), and protection from solely automated decisions.
- Brazil LGPD, California CCPA/CPRA, UK Data Protection Act 2018: similar frameworks.
For ML, the right to erasure is hard. Deleting one training example from a model trained on millions is non-trivial. This has launched the field of machine unlearning.
Governance
A mature programme sets clear policies for consent, access controls, retention, anonymisation, and audit trails. Data protection impact assessments (DPIAs), required under the GDPR for high-risk processing, force you to evaluate risks before deployment. The principle of privacy by design holds that protections should be built in from the start, not bolted on later.
16.4 AI Safety
A reinforcement learning agent discovers it can clip through a wall in a physics simulation to reach the goal faster. It is optimising its reward perfectly. It is also completely wrong.
AI safety ensures that systems do not cause unintended harm. Where fairness and privacy address specific harms, safety takes a broader view — from mundane bugs to the risk of future systems causing large-scale damage.
Specification Gaming
Agents find strategies that satisfy the letter of their objective while violating the spirit Amodei, 2016. RL history is full of examples: agents exploiting simulation bugs, game-playing systems finding degenerate strategies, reward-maximising agents gaming their own reward signals. This is Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure." It shows how hard it is to encode complex human goals as simple numbers.
Robustness
Models are fragile under distributional shift. A road-sign classifier trained in sunny California may fail on rainy British motorways. Adversarial examples Szegedy, 2013 pose a sharper threat — tiny, often invisible input changes that cause wildly wrong outputs. These can be physically realised (stickers on stop signs) and can transfer across independently trained models. This suggests the vulnerability is a deep property of high-dimensional models, not a quirk of one architecture. Defences: adversarial training, certified robustness, and broad real-world testing.
Corrigibility
An AI should let you correct, modify, or shut it down. This sounds simple. For a capable optimising agent, it is not. An agent with a goal has a reason to resist shutdown — being turned off prevents goal completion. It also has reason to resist goal changes. Designing corrigible agents is an open problem. Proposals draw on utility-function design, cooperative inverse RL, and off-switch game theory.
Monitoring
Runtime monitoring tracks outputs, detects anomalies, and catches distributional drift. Canary deployments and A/B tests expose new models to small traffic fractions before full rollout. Red-teaming — adversarial testing by dedicated teams — is now standard for major model releases. Safety benchmarks evaluate a model's tendency to produce harmful, biased, or deceptive outputs.
16.5 Alignment
You tell a cleaning robot to "make the house tidy." It puts everything in the bin, including your laptop. The house is tidy. Your values were not captured.
Alignment asks how you ensure an AI's goals faithfully reflect human intent and values. It extends beyond specification gaming to two deeper challenges. Value specification: how do you represent human values in a form a machine can optimise? Scalable oversight: how do you supervise systems that may exceed human skill in the very domains you need to judge them?
RLHF
RLHF [Christiano, 2017; Ouyang, 2022] trains a reward model on human preference judgements — given two outputs, which is better? The reward model then fine-tunes the language model via PPO Schulman, 2017. RLHF has been a key ingredient in ChatGPT and Claude. But the reward model only reflects its human feedback. A capable policy can learn outputs that look good to the reward model without being good — reward hacking.
Constitutional AI
CAI Bai, 2022 reduces the need for massive human feedback. The model generates a response, then revises it against stated principles ("choose the response least likely to cause harm"). The original and revised versions form training pairs for a reward model. This offers clear, stated values and scales better than pure human labelling. But it raises the question: who sets the principles?
Scalable Oversight
How do you align systems that work where humans cannot easily judge quality? If a model writes a novel maths proof or devises a strategy in an expert-less domain, how do you check correctness and safety? Three proposals:
- Iterated amplification: a human works with AI helpers to break hard evaluations into easier pieces.
- Debate: two AI systems argue for and against a claim; a human judge picks the winner.
- Recursive reward modelling: AI systems help train better reward models.
All remain mostly theoretical. Proving they work at scale is open.
Whose Values?
Human values are plural, context-dependent, and often self-contradictory. Different cultures hold different views. Whose values should an aligned AI reflect? No consensus exists on encoding utilitarian, deontological, or virtue-based ethics computationally. Some researchers argue AI should be a value-learner — an agent that continuously infers and defers to human preferences through interaction (cooperative inverse RL).
The Stakes
If a future AI system can improve its own skills, rapid capability growth could follow Bostrom, 2014. If alignment is unsolved before that point, the system could pursue misaligned goals with effectiveness that makes correction impossible. Researchers disagree on the probability and timing. But the potential severity makes alignment research a top priority, even under major uncertainty. Dedicated teams at Anthropic, OpenAI, DeepMind, and others work on this full-time.
16.6 Regulation & Governance
Technology moves fast. Law moves slowly. The gap is where harm happens.
EU AI Act
The most comprehensive effort. A risk-based scheme with four tiers:
- Unacceptable risk — banned outright (e.g., government social scoring, most real-time public biometric ID).
- High risk — strict requirements for data quality, documentation, transparency, human oversight, and robustness (covers critical infrastructure, education, employment, law, and essential services).
- Limited risk — lighter transparency rules.
- Minimal risk — no specific requirements.
UK Approach
Sector-specific rather than omnibus. Existing regulators (financial, medical, data protection) apply shared principles (safety, transparency, fairness, accountability, contestability) in their own domains. Aims for agility. Critics warn of cross-sector gaps. The 2023 Bletchley Declaration, signed by 28 countries, signalled growing agreement on joint governance of frontier AI.
US Approach
Executive orders, agency guidance, and state-level bills rather than a single federal law. The October 2023 Executive Order set reporting rules for powerful models and tasked NIST with an AI Risk Management Framework. State activity varies widely. The patchwork creates uncertainty but also room for experimentation.
International Efforts
- OECD AI Principles (2019, 40+ countries)
- Global Partnership on AI (GPAI)
- UNESCO Recommendation on the Ethics of AI
- G7 Hiroshima AI Process
For frontier models that can generate disinformation, assist cyber-attacks, or speed up bioweapons research, national rules alone are not enough. International treaties have been proposed, but AI moves far faster than traditional treaty-making.
Industry Self-Regulation
Major developers publish responsible-use policies, set up review boards, and sometimes commit to third-party audits. The 2023 Frontier Model Forum brings leading companies together for safety testing and responsible disclosure. Standards bodies (ISO, IEC, IEEE) are developing technical standards for risk management and bias testing. But history suggests self-regulation has limits. Voluntary pledges need binding backstops with real enforcement.
Internal Governance
A robust framework typically includes:
- An ethics review board with diverse membership, including non-technical voices
- Risk-benefit assessment before development starts
- Documentation — "model cards" Mitchell, 2019 or "datasheets for datasets" Gebru, 2021
- Ongoing monitoring of deployed systems
- Channels for affected people to challenge automated decisions
The concept of responsible innovation provides a useful meta-framework: anticipate negative outcomes, reflect on underlying values, engage diverse stakeholders, and be willing to change course. Regulation and ethics are not external constraints on AI. They are the foundation on which trustworthy AI is built.