Modern AI: 15.21   Safety, interpretability, and the open questions

Dr Chris Paton

15.21 Safety, interpretability, and the open questions

We should not finish a chapter on modern AI without acknowledging that the deployment of these systems has run substantially ahead of our understanding of them. Three areas where the open questions are most acute.

Mechanistic interpretability

The agenda, reverse-engineer the circuits inside a trained Transformer, has produced concrete results since 2022. Induction heads and other small circuits have been characterised in detail. Sparse autoencoders (Cunningham et al., 2023; Anthropic Sonnet SAE work, 2024) have decomposed model activations into interpretable features, with millions of human-meaningful directions identified in production models. Activation steering, adding a learned vector to residual streams to control behaviour, has emerged as a practical control technique.

But the gap between "we have found a thousand interpretable circuits" and "we understand what the model is doing on a typical query" remains vast. The frontier of mechanistic interpretability research in 2026 is in the millions of features, not the trillions that comprise a frontier model's full capability.

Scalable oversight

How do you supervise a model that may exceed human capability in some domain? The proposed approaches (debate, recursive reward modelling, weak-to-strong generalisation Burns, 2023) all rely on some structural property of the supervision problem, that it is easier to verify than to generate, that disagreement reveals truth, that strong students can extract supervision signal even from weak teachers. None of these properties is universally robust.

Empirically, weak-to-strong generalisation (training a strong model on labels generated by a weaker model) recovers most but not all of the strong model's capability when fully supervised. The gap, the "alignment tax" of weak supervision, is the open problem.

What models believe

A practical question that turns out to be deep: when a model says "I think X", what does that mean? Models can be calibrated (their stated probabilities match empirical frequencies), can have introspective access to some of their own computations, and can be deceptive (Park et al., 2023). They do not have beliefs in any obvious sense, but they have something, a set of consistent dispositions, maintained across contexts, that respond predictably to prompts and pressures.

Whether this something is enough for the model to be a moral agent, a moral patient, or merely a complex tool is among the largest open questions of the 2020s, and one we will not settle in this book.