11.7 Transfer learning
Imagine you have been asked to train a convolutional network to spot pneumonia in chest X-rays. You have five thousand labelled images, a single GPU and a fortnight to deliver something useful. Training a modern CNN from scratch in this situation is hopeless. Networks like ResNet-50 were trained on ImageNet, over a million images across a thousand categories, using clusters of GPUs running for several days. Five thousand X-rays will not get you anywhere near the same place. The network has tens of millions of parameters; with so few examples it will simply memorise the training set and fail on anything new.
Transfer learning is the idea that rescues you. Instead of starting from random weights, you start from a network somebody else has already trained on a much larger dataset. You keep most of what it has learned and retrain only the parts that need to know about your particular problem. The result is that you get most of the performance of a fully trained network for a tiny fraction of the cost and a tiny fraction of the data. This is not a niche trick, it is the dominant deep-learning workflow in production today. Almost nobody trains an image model from scratch any more. They take a pretrained one, adapt it, and ship.
Where §11.3 was about how to design a network, this section is about how to reuse one somebody else has already designed and trained — which is what you will almost always do in practice. Transfer learning reappears in Chapter 15 at the scale of foundation models such as CLIP and the GPT family, pretrained on enormous corpora and adapted to thousands of downstream tasks. The principles are the same; only the scale changes.
The basic recipe
The mechanics of transfer learning are simple. There are three steps, and once you have done it once you can do it on autopilot.
Step 1: take a pretrained model. Every major deep-learning framework ships with model zoos containing networks pretrained on ImageNet. In PyTorch, torchvision.models.resnet50(weights="IMAGENET1K_V2") gives you a ResNet-50 with parameters that have already converged on the ImageNet classification task. The download is a few hundred megabytes; after that it is a one-line operation.
Step 2: replace the final classifier. A pretrained ImageNet model has a final fully connected layer that maps from the last hidden representation (2048 dimensions for ResNet-50) to one thousand class scores. You almost certainly do not want one thousand classes; you want however many classes your task has. Throw the old final layer away and bolt on a new one with the right output size. For binary pneumonia detection, that is two outputs, or one output with a sigmoid. The new layer's weights are initialised randomly, it is the only part of the network that is.
Step 3: train. Now you have a choice, and this is where the two main flavours of transfer learning diverge.
The first flavour is feature extraction. You freeze every layer in the pretrained network, meaning their weights will not update during training, and train only the new final layer. The pretrained network becomes a fixed feature extractor: an image goes in, a 2048-dimensional vector comes out, and a small classifier on top learns to map those vectors to your classes. This is fast, since you only update a few thousand parameters, and it works well when your task is similar to the source task. It also works with very little data, sometimes a few dozen examples per class is enough, because the heavy lifting has already been done.
The second flavour is fine-tuning. You leave the pretrained weights unfrozen and train every layer, but with a much smaller learning rate than you would use from scratch, typically $10^{-4}$ rather than $10^{-3}$. The intuition is that the pretrained weights are already nearly right; large updates would destroy what makes them useful. Fine-tuning is slower and uses more memory, but it usually delivers better final accuracy than feature extraction, especially when your task differs noticeably from ImageNet. A common compromise is to start with feature extraction for a few epochs (so the new head settles down before its random gradients can disturb the rest of the network), then unfreeze and fine-tune the whole thing at a small learning rate for many more.
That is the entire recipe. Three steps, two flavours, and you are off.
Why it works
Why should weights trained on photographs of cats and trucks be any use for chest X-rays or satellite imagery? The answer lies in what different layers of a CNN actually learn.
If you visualise the filters in the first convolutional layer of a trained network, there is a famous figure of this for AlexNet, you see oriented edges at various angles, simple colour blobs and small frequency patterns. These are the building blocks of vision itself. Any natural image, whether a portrait, an X-ray or a galaxy, is made of edges at various orientations and local patches of colour or intensity. The first layer is a visual alphabet, and the alphabet is the same regardless of what is written with it.
The middle layers learn combinations of those primitives: textures, contour fragments, simple shapes, patterns. These are still fairly generic. The texture of fur is not a million miles from the texture of bone trabeculae on an X-ray. The contour of a wheel rim is geometrically similar to the edge of a tumour. By the time you get into the late layers, the features become specific to the source task, entire object parts like dog faces, car bumpers, bird beaks. These are the features least likely to transfer, because they are tied to the categories the network was originally asked to discriminate.
This gives a clean intuition for what to do. Early layers contain general-purpose vision features and should mostly be left alone. Late layers contain task-specific features and should be replaced or retrained. The deeper into the network you go, the more aggressive your changes should be. Transfer learning works because the early-layer features really are general, a property that researchers have verified by training networks on completely unrelated source datasets and finding that the early layers still look more or less the same.
Worked: ImageNet to a small medical dataset
Let us return to the pneumonia example and work through it concretely. You have five thousand labelled chest X-rays, split eighty-twenty into training and validation. The labels are binary: pneumonia present or not.
Approach one is to train a ResNet-50 from scratch with random initialisation. With only four thousand training images and twenty-five million parameters, the network overfits within a few epochs. You add data augmentation, random crops, rotations, contrast jitter, and tune everything you can. You end up with an area under the ROC curve of about 0.70. That is barely better than picking the radiologist's age out of a hat.
Approach two is transfer learning. You load resnet50(weights="IMAGENET1K_V2"). You replace the final 1000-way classifier with a single linear layer of two outputs (or one, with binary cross-entropy). You move the model to the GPU, set the optimiser to AdamW with a learning rate of $10^{-4}$, and train for twenty epochs with the same augmentation as before. The network converges in a few hours. Validation AUC ends up at 0.92.
Why such a vast gap? Because the pretrained network already knows what edges, textures and shapes look like; it does not need to relearn that from your four thousand X-rays. It only needs to learn what those features mean in the context of pneumonia, which is a far easier problem with far fewer parameters effectively at stake. The pretrained backbone has done ninety per cent of the work for you; your task is to fine-tune the last ten per cent.
This pattern, order-of-magnitude AUC gains from transfer learning on small medical datasets, is so reliable that it has become the default. Almost every published medical-imaging paper of the last decade starts from an ImageNet-pretrained backbone, even though X-rays look nothing like ImageNet photographs. The features really are that general.
Catastrophic forgetting
Fine-tuning is not without hazards. The chief one is catastrophic forgetting: if you train too aggressively, you can erase the very pretrained features you wanted to keep. This happens when the learning rate is too large, or when the task is too narrow, or when you train for too long on too little target data. The network drifts away from its useful starting point and ends up worse than if you had used feature extraction.
Several mitigations are standard. Discriminative learning rates assign different learning rates to different parts of the network: a higher rate for the new head, an intermediate rate for the late convolutional layers, and a much smaller rate for the early ones. The reasoning is that the early layers are the most general, the most reusable and the most expensive to repair if you damage them, so they should change least.
A related approach is layer-wise learning rate decay, where the learning rate decreases geometrically as you go deeper into the network from the head. This is now a near-universal default for fine-tuning large vision and language models, often with a decay factor around 0.9 per layer.
A third mitigation is gradual unfreezing. You freeze the entire backbone, train only the head for a few epochs until it has stabilised, then unfreeze the last block, train for a few more, then unfreeze the next-to-last block, and so on. The network is allowed to adapt one layer at a time, never letting the random gradients of an untrained head propagate destructively through pretrained weights.
Used together, these techniques make fine-tuning almost foolproof. You will rarely see catastrophic forgetting in practice unless you are doing something egregious with your learning rate.
Domain adaptation
Sometimes the source and target distributions differ in a structural way. Natural photographs and medical scans are one obvious example; computer-rendered synthetic images and real-world photographs are another; daytime driving footage and night-time footage a third. The pretrained features still help, but there is a residual gap that ordinary fine-tuning cannot close, especially if your target labels are scarce.
Domain adaptation is the family of techniques that try to bridge that gap. Domain-adversarial training adds a second classifier that tries to predict whether a feature vector came from the source or the target distribution; the main network is trained to fool it. The result is a feature extractor that produces representations indistinguishable across domains, which the task classifier can then use without worrying about which side of the gap any given example came from.
Cycle-consistent translation, introduced by CycleGAN, learns to convert images from one domain to another and back without paired examples, turning summer photos into winter photos, or synthetic driving scenes into realistic ones, so that you can pretrain on the translated source images. These methods matter most in industrial settings, where labelled target data is genuinely expensive and the domain gap is large.
Where transfer learning shows up everywhere
Transfer learning is no longer one technique among many. It is the structural foundation of modern AI.
In computer vision, ImageNet pretraining underwrites almost every applied model: medical imaging, satellite analysis, microscopy, agricultural monitoring, manufacturing inspection. The dominant workflow is download a backbone, replace the head, fine-tune.
In natural language processing, the equivalent is BERT and GPT pretraining. A model is pretrained on a massive corpus of unlabelled text using a self-supervised objective, masked language modelling for BERT, next-token prediction for GPT, and the resulting representations are fine-tuned on downstream tasks like sentiment analysis, named-entity recognition, question answering or summarisation. This pattern, established around 2018, replaced an entire generation of bespoke task-specific NLP architectures within about two years.
CLIP, published by OpenAI in 2021, took the idea further. It pretrains an image encoder and a text encoder jointly on four hundred million image–text pairs scraped from the web, so that the embedding of an image lies close to the embedding of its caption. The image encoder transfers well, and the text encoder enables zero-shot classification: you describe your categories in words ("a photo of a tabby cat", "a photo of a chest X-ray with pneumonia") and pick whichever description has the highest similarity to the image. No fine-tuning required, no labelled examples required.
Foundation models, GPT-4, Claude, Gemini, Llama, DeepSeek, are transfer learning at the largest scale yet attempted. A single base model is pretrained at vast expense, then adapted via fine-tuning, instruction tuning, RLHF, LoRA, prompting and retrieval to thousands of downstream tasks. The pretraining cost is amortised over millions of users. This is the same idea you applied to your five thousand chest X-rays, scaled up by six orders of magnitude.
What you should take away
- Almost nobody trains an image model from scratch any more, they fine-tune a pretrained backbone, and you should too unless you have ImageNet-scale data and ImageNet-scale compute.
- The basic recipe has three steps: take a pretrained model, replace its final classifier, then either freeze the backbone (feature extraction) or train the whole thing at a small learning rate (fine-tuning).
- Transfer learning works because early CNN layers learn generic features, edges, textures, shapes, that are shared across nearly all visual tasks; only the late, task-specific layers need replacement.
- To avoid catastrophic forgetting, use small learning rates, discriminative or layer-wise decayed learning rates, and consider gradual unfreezing.
- The same idea, pretrain once on a vast dataset, adapt cheaply to many downstream tasks, underlies BERT, GPT, CLIP and the foundation models of Chapter 15. Transfer learning is no longer a trick; it is the architecture of modern AI.