Each token activates only a few experts; the network grows in capacity without growing in compute per token.
From the chapter: Chapter 15: Modern AI
Glossary: mixture of experts
Transcript
A normal transformer feed-forward block: every parameter is used for every token.
Mixture of experts replaces it with many parallel feed-forward networks, called experts. Eight, sixty-four, two hundred and fifty-six.
For each token, a small router network outputs scores over experts. Pick the top k, often two. Run only those experts on this token. Combine their outputs, weighted by the router scores.
Each token uses k expert blocks instead of all of them. Compute per token is constant. Total parameter count grows with the number of experts.
A model with sixty-four experts and top-two routing has roughly thirty-two times the parameters but the same FLOPs per token as the dense baseline.
Training challenges. Load balancing: experts must each see a similar number of tokens, else some starve and some saturate. Auxiliary losses encourage even routing.
Mixture of experts powers Switch Transformer, GLaM, Mixtral, DeepSeek, and recent Llama variants. The key insight: scale model capacity through specialisation, not through dense compute.
At inference time, the router decides per token. Two experts process a token; the others are idle. This makes MoE a bargain at deployment, especially with kernels that handle expert sparsity efficiently.
Capacity without quadratic compute. The recipe behind much of recent scaling.