A mixture of experts (MoE) layer replaces a single dense feed-forward layer with $E$ expert sub-networks $\{f_e\}_{e=1}^E$ (each typically an FFN) and a router $g$ that, for each input token, selects a small subset of experts to evaluate.
Top-$k$ routing: for input token $x$, compute routing logits $h = W_g x$, take the top-$k$ entries, and apply softmax over them:
$$\mathrm{TopK}(h)_e = \begin{cases} h_e & \text{if } h_e \in \text{top-}k \text{ entries} \\ -\infty & \text{otherwise} \end{cases}$$
$$g_e(x) = \mathrm{softmax}(\mathrm{TopK}(h))_e$$
MoE output:
$$y = \sum_{e: g_e(x) > 0} g_e(x) \cdot f_e(x)$$
With top-$k$ routing, only $k$ of the $E$ experts are evaluated per token, typically $k = 1$ or $k = 2$, giving sparse computation. The model has $E$ times more parameters than equivalent dense, but uses only $k/E$ of them per token.
Load-balancing loss prevents the router collapsing to using only a few experts. For a batch with token assignments $f_i$ (fraction of tokens routed to expert $i$) and average router probability $P_i$:
$$\mathcal{L}_{\mathrm{aux}} = \alpha E \sum_i f_i \cdot P_i$$
with $\alpha$ a small balancing coefficient. This term is minimised when load is uniform across experts.
Expert capacity: each expert can process at most $C = (\text{tokens}/E) \times \text{capacity factor}$ tokens per batch. Tokens that exceed an expert's capacity are dropped (set to zero output). Capacity factor is typically 1.0 to 1.25.
Modern MoE models:
- Switch Transformer (Fedus 2022): top-1 routing, scales to 1.6 trillion parameters.
- Mixtral 8×7B (Mistral 2023): 8 experts per layer, top-2 routing, 47B total / 13B active parameters.
- DeepSeek-V3 (2024): 256 experts per MoE layer, top-8 routing with shared experts, 671B total / 37B active.
- GPT-4 (rumoured): ~16 experts, top-2.
Computational benefits: at fixed quality, MoE achieves several-fold faster inference than dense models of equivalent FLOPs-active. The trade-off is memory, the full parameter count must fit in (distributed) GPU memory even though only a fraction is used per token.
Engineering challenges: efficient MoE inference requires careful expert placement across GPUs (each token's selected expert may live on a different device, requiring all-to-all communication), communication overlap with computation, and routing-strategy choices that minimise tail latency. Production MoE systems (Mixture-of-Experts inference frameworks, vLLM MoE support, SGLang) have substantial engineering investment.
Video
Related terms: Mixture of Experts, noam-shazeer, Transformer
Discussed in:
- Chapter 15: Modern AI, Modern AI