Model-Agnostic Meta-Learning (MAML), introduced by Finn, Abbeel and Levine (2017), is a meta-learning algorithm that learns model parameters $\theta$ such that a small number of gradient steps from $\theta$ on a new task produces a high-performing task-specific model. The "model-agnostic" qualifier reflects that MAML is a generic procedure compatible with any model trained by gradient descent , supervised classification, regression, or reinforcement learning.
Bilevel optimisation. Suppose a distribution $p(\mathcal{T})$ over tasks; each task $\mathcal{T}_i$ has a loss $\mathcal{L}_i$, a small support set $\mathcal{D}_i^{\text{tr}}$ for fine-tuning, and a query set $\mathcal{D}_i^{\text{val}}$ for evaluation. The inner loop adapts $\theta$ to task $i$ by one or more gradient steps on the support set:
$$\theta_i' = \theta - \alpha \nabla_\theta \mathcal{L}_i(\theta;\, \mathcal{D}_i^{\text{tr}}).$$
The outer loop updates $\theta$ to minimise the post-adaptation loss on the query set, summed over a batch of tasks:
$$\theta^* = \arg\min_\theta\; \sum_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_i\!\big(\theta_i';\, \mathcal{D}_i^{\text{val}}\big) \;=\; \arg\min_\theta \sum_i \mathcal{L}_i\!\big(\theta - \alpha \nabla_\theta \mathcal{L}_i(\theta);\, \mathcal{D}_i^{\text{val}}\big).$$
The outer gradient is
$$\nabla_\theta \mathcal{L}_i(\theta_i') = (\mathbf{I} - \alpha \nabla^2_\theta \mathcal{L}_i(\theta;\, \mathcal{D}_i^{\text{tr}}))^\top\, \nabla_{\theta_i'} \mathcal{L}_i(\theta_i';\, \mathcal{D}_i^{\text{val}}).$$
The Hessian-vector product is computed automatically by reverse-mode autodiff through the inner update.
Algorithm.
- Sample a batch of tasks $\{\mathcal{T}_i\}$ from $p(\mathcal{T})$.
- For each $\mathcal{T}_i$:
- sample support $\mathcal{D}_i^{\text{tr}}$ and query $\mathcal{D}_i^{\text{val}}$;
- compute adapted parameters $\theta_i' = \theta - \alpha \nabla_\theta \mathcal{L}_i(\theta;\, \mathcal{D}_i^{\text{tr}})$.
- Update meta-parameters: $\theta \leftarrow \theta - \beta\, \nabla_\theta\, \tfrac{1}{B} \sum_i \mathcal{L}_i(\theta_i';\, \mathcal{D}_i^{\text{val}})$.
Variants.
- First-Order MAML (FOMAML). Drops the second-order term in the outer gradient, treating $\theta_i'$ as if its dependence on $\theta$ were the identity. Far cheaper, often nearly as effective.
- Reptile (Nichol et al., 2018). An even simpler alternative: do $K$ inner SGD steps, then update $\theta \leftarrow \theta + \beta(\theta_i^{(K)} - \theta)$. No second-order computation, no separate query split.
- MAML++ (Antoniou et al., 2019). A bag of stabilisation tricks: per-step learnable inner learning rates, batch-norm running statistics handled per task, gradient clipping.
- iMAML. Replaces inner SGD with implicit differentiation through a regularised inner optimum, decoupling the outer gradient from the inner trajectory length.
Empirical results. On Omniglot 5-way 1-shot classification MAML reaches $\sim 99\%$ accuracy after one gradient step at test time. On miniImageNet 5-way 5-shot it reaches $\sim 63\%$. In meta-RL it learns control policies that adapt to new MuJoCo tasks within a handful of episodes.
Interpretation. MAML can be viewed as learning an initialisation lying near a manifold of good task-specific optima, so that one gradient step jumps from $\theta$ onto the right region. Connections have been drawn to multi-task learning, transfer learning, and pre-trained language models, the modern view that LLM in-context learning is a kind of implicit MAML at the activation level.
Limitations.
- Second-order computation is memory-intensive; FOMAML is the practical default.
- Sensitive to inner step size and number of inner steps.
- Performance gains shrink relative to simple pretraining once large amounts of meta-training data are available.
Related terms: Gradient Descent, Continual Learning
Discussed in:
- Chapter 13: Attention & Transformers, Meta-Learning