11.4 Visualising filters and feature maps
A persistent question is: what do the layers of a CNN actually compute? The first systematic answer came from Matthew Zeiler and Rob Fergus's Visualizing and Understanding Convolutional Networks (ECCV 2014), which used a deconvolution technique to project activations back into pixel space. Their headline observation was that the layers of an AlexNet-like network organised themselves by abstraction level. Layer 1 filters were oriented edge and colour-blob detectors, almost identical to the receptive fields of simple cells in V1. Layer 2 captured corners, junctions and contour fragments. Layer 3 captured texture combinations: mesh patterns, animal skin, text-like structures. Layers 4 and 5 captured object parts (dog faces, wheels, faces) and whole objects.
Chris Olah and colleagues at OpenAI and Google Brain extended these ideas in a series of Distill articles between 2017 and 2020 (Feature Visualization, The Building Blocks of Interpretability, An Overview of Early Vision in InceptionV1). Their work produced activation atlases, maps of what the network "sees", and showed that individual neurons in InceptionV1 could be rigorously characterised: this neuron fires for tabby cat fur, that one for left-pointing arrows, this one for water surfaces seen from above. The pictures, often striking, made the case that CNNs build interpretable concept hierarchies whether we ask them to or not.
The standard techniques are:
- Direct filter visualisation: just plot the kernel weights. Useful for the first layer (where the input is RGB pixels) and almost useless thereafter.
- Activation maximisation: optimise an input image to maximise a chosen unit's activation, regularised to look like a natural image. The result is a synthetic "ideal stimulus" for that unit.
- Class activation mapping (CAM) and its successor Grad-CAM (Selvaraju et al., 2017): produce a heatmap over the input showing which regions drive the prediction of a particular class. Grad-CAM weights each feature map by the average gradient of the target class with respect to that feature map, and visualises the weighted sum.
- Maximally activating images: search the training set for the images that most strongly activate a chosen unit.
These tools have practical value beyond curiosity: they help debug models that have learned to rely on spurious correlations (the camera-watermark detector that pretended to be a horse classifier; the tank-detector that learned to recognise sunny weather), and they form the basis of much modern interpretability research, recently extended to Transformers via the same lineage of work.
Saliency, integrated gradients, and adversarial fragility
A natural alternative to filter visualisation is saliency: compute $\partial \log p_y / \partial x$, the gradient of the predicted class log-probability with respect to the input pixels. Pixels with large absolute gradient magnitudes are those the model considers most relevant for the prediction. The plain saliency map (Simonyan, Vedaldi, Zisserman, 2013) is noisy because the gradient at a single point captures a local linearisation that may be unrepresentative. Integrated gradients (Sundararajan, Taly, Yan, ICML 2017) integrates the gradient along a path from a baseline (often a black image) to the actual input,
$$ \mathrm{IG}_i(x) = (x_i - x_i^0) \cdot \int_0^1 \frac{\partial f(x^0 + \alpha (x - x^0))}{\partial x_i} \, d\alpha, $$
which produces sharper, theoretically grounded attributions. SmoothGrad averages gradients over noisy copies of the input, achieving similar smoothing without the path integral.
These tools also reveal a darker side of CNN representations: they are fragile. A perturbation imperceptible to a human, designed to maximise the gradient of an incorrect class, can flip a high-confidence "panda" into a high-confidence "gibbon" (Goodfellow, Shlens, Szegedy, 2014). The susceptibility persists in modern CNNs and motivates a substantial sub-field of adversarial robustness, beyond the scope of this chapter.