Also known as: XAI, interpretable AI
Explainable AI (XAI) is concerned with making the internal workings and outputs of machine learning models comprehensible to human stakeholders. The need arises from a fundamental tension: the most accurate models—deep neural networks, large ensembles, transformers—are also the most opaque. A linear regression with ten coefficients is directly inspectable; a neural network with hundreds of millions of parameters is not. When such a model denies a loan, flags a patient for disease, or recommends parole, affected individuals and decision-makers have a legitimate interest in understanding why.
Approaches divide into intrinsic and post-hoc explainability. Intrinsically interpretable models—decision trees, rule lists, generalised additive models, sparse linear classifiers—are designed so their logic is directly readable. Cynthia Rudin argues that in high-stakes domains one should prefer interpretable models outright rather than explaining opaque ones after the fact. Post-hoc methods provide explanations for black-box models. LIME perturbs the input around a point and fits a simple local surrogate model. SHAP uses Shapley values from cooperative game theory to assign each feature its average marginal contribution. Both are model-agnostic.
For deep networks, saliency maps compute the gradient of the output with respect to each input feature, highlighting influential pixels. Integrated Gradients refines this by accumulating attributions along a path from a baseline. Attention visualisation shows which tokens a transformer attends to, though attention weights do not necessarily reflect causal contribution. Concept-based explanations (TCAV) operate at higher levels of abstraction. Despite progress, XAI remains a contested field: explanations can be misleading, user studies show they sometimes increase trust without improving decisions, and the precise meaning of a "good explanation" depends heavily on the user and context.
Discussed in:
- Chapter 16: Ethics & Safety — Explainable AI
Also defined in: Textbook of AI