Continual Learning, Glossary, Textbook of AI

Continual learning (also lifelong learning, incremental learning) is the problem of training a single model on a stream of tasks $\mathcal{T}_1, \mathcal{T}_2, \ldots, \mathcal{T}_K$ presented sequentially, with the requirement that performance on earlier tasks be preserved as new ones are learned. Standard SGD on i.i.d. mini-batches violates this assumption: training on $\mathcal{T}_k$ overwrites the weights that solved $\mathcal{T}_{k-1}$, a phenomenon called catastrophic forgetting (McCloskey & Cohen, 1989).

Setting. Three canonical scenarios (van de Ven & Tolias, 2019):

Task-incremental. A task identifier is given at test time; the model can use a per-task output head.
Domain-incremental. Same label space across tasks but inputs shift (e.g. weather → night driving).
Class-incremental. New classes arrive in each task and must be distinguished from old ones without identifiers, the hardest case.

Method families.

(1) Regularisation-based. Add a penalty that protects parameters important to earlier tasks. Elastic Weight Consolidation (EWC; Kirkpatrick et al., 2017) is canonical:

$$\mathcal{L}_{\text{EWC}}(\theta) = \mathcal{L}_k(\theta) + \frac{\lambda}{2} \sum_i F_{i}\, (\theta_i - \theta_{k-1, i}^*)^2,$$

where $\theta_{k-1}^*$ is the converged solution after task $k-1$ and $F_i$ is the diagonal Fisher information of the previous task's likelihood at $\theta_{k-1}^*$. Parameters with high Fisher contribute more to the penalty, reflecting their importance. Variants: Synaptic Intelligence (online proxy for $F$), Memory Aware Synapses, MAS.

(2) Replay-based. Store or generate samples from past tasks and interleave them with new-task data. Experience Replay with a small reservoir buffer is a strong baseline. iCaRL (Rebuffi et al., 2017) keeps exemplars chosen by herding to match class means; Generative Replay (Shin et al., 2017) uses a learned generator instead. Replay tackles all incremental scenarios but raises privacy concerns when raw data must be retained.

(3) Parameter-isolation / architectural. Allocate task-specific parameters. Progressive Networks (Rusu et al., 2016) freeze old columns and add new ones with lateral connections. PackNet (Mallya & Lazebnik, 2018) iteratively prunes and re-uses unused weights. Dynamically Expandable Networks (Yoon et al., 2018) grow capacity only as needed. These methods avoid forgetting by construction but inflate model size with the number of tasks.

(4) Optimisation-based / meta-learning. Meta-train an initialisation that is robust to forgetting (OML, ANML), or use second-order constraints to stay in a flat region of the loss (Gradient Episodic Memory, A-GEM).

Evaluation metrics (Lopez-Paz & Ranzato, 2017). Define $a_{k,j}$ as accuracy on task $j$ after training on tasks $1, \ldots, k$. Then

Average accuracy: $\bar{A} = \tfrac{1}{K} \sum_{j=1}^K a_{K,j}$.
Backward transfer: $\mathrm{BWT} = \tfrac{1}{K-1}\sum_{j=1}^{K-1} (a_{K,j} - a_{j,j})$. Negative BWT indicates forgetting.
Forward transfer: $\mathrm{FWT} = \tfrac{1}{K-1} \sum_{j=2}^{K} (a_{j-1, j} - b_j)$, where $b_j$ is a random-init baseline.

Why it matters. Real-world AI systems, clinical decision support, robotic agents, recommenders, face non-stationary distributions and cannot be retrained from scratch each time data shifts. Continual learning is also a touchstone for biological plausibility, since brains learn sequentially without catastrophic interference.

State of the art. No single method dominates all scenarios. Class-incremental learning remains the hardest setting; even with replay, gaps to joint-training baselines persist. Recent work combines large pretrained backbones with prompt-tuning (L2P, DualPrompt), exploiting the rich representations of foundation models to reduce forgetting at the cost of increased memory.

Related terms: Federated Learning, MAML, Gradient Descent

Discussed in:

Chapter 13: Attention & Transformers, Continual Learning

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).