Mixture of Experts (mathematical detail), Glossary, Textbook of AI

A mixture of experts (MoE) layer replaces a single dense feed-forward layer with $E$ expert sub-networks $\{f_e\}_{e=1}^E$ (each typically an FFN) and a router $g$ that, for each input token, selects a small subset of experts to evaluate.

Top-$k$ routing: for input token $x$, compute routing logits $h = W_g x$, take the top-$k$ entries, and apply softmax over them:

$$\mathrm{TopK}(h)_e = \begin{cases} h_e & \text{if } h_e \in \text{top-}k \text{ entries} \\ -\infty & \text{otherwise} \end{cases}$$

$$g_e(x) = \mathrm{softmax}(\mathrm{TopK}(h))_e$$

MoE output:

$$y = \sum_{e: g_e(x) > 0} g_e(x) \cdot f_e(x)$$

With top-$k$ routing, only $k$ of the $E$ experts are evaluated per token, typically $k = 1$ or $k = 2$, giving sparse computation. The model has $E$ times more parameters than equivalent dense, but uses only $k/E$ of them per token.

Load-balancing loss prevents the router collapsing to using only a few experts. For a batch with token assignments $f_i$ (fraction of tokens routed to expert $i$) and average router probability $P_i$:

$$\mathcal{L}_{\mathrm{aux}} = \alpha E \sum_i f_i \cdot P_i$$

with $\alpha$ a small balancing coefficient. This term is minimised when load is uniform across experts.

Expert capacity: each expert can process at most $C = (\text{tokens}/E) \times \text{capacity factor}$ tokens per batch. Tokens that exceed an expert's capacity are dropped (set to zero output). Capacity factor is typically 1.0 to 1.25.

Modern MoE models:

Switch Transformer (Fedus 2022): top-1 routing, scales to 1.6 trillion parameters.
Mixtral 8×7B (Mistral 2023): 8 experts per layer, top-2 routing, 47B total / 13B active parameters.
DeepSeek-V3 (2024): 256 experts per MoE layer, top-8 routing with shared experts, 671B total / 37B active.
GPT-4 (rumoured): ~16 experts, top-2.

Computational benefits: at fixed quality, MoE achieves several-fold faster inference than dense models of equivalent FLOPs-active. The trade-off is memory, the full parameter count must fit in (distributed) GPU memory even though only a fraction is used per token.

Engineering challenges: efficient MoE inference requires careful expert placement across GPUs (each token's selected expert may live on a different device, requiring all-to-all communication), communication overlap with computation, and routing-strategy choices that minimise tail latency. Production MoE systems (Mixture-of-Experts inference frameworks, vLLM MoE support, SGLang) have substantial engineering investment.

Video

Related terms: Mixture of Experts, noam-shazeer, Transformer

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).