Also known as: MoE
A Mixture of Experts (MoE) replaces standard dense layers with a collection of expert networks and a gating mechanism that routes each input to only a small subset of experts. In a typical MoE transformer, the feed-forward sub-layer is replaced by $N$ expert networks (each a standard MLP), and a learned gate routes each token to the top-$k$ (typically 1 or 2) experts based on the token's representation. The outputs of the chosen experts are combined, and the unused experts contribute nothing.
The key advantage of MoE is that total parameter count can scale far beyond what a dense model could accommodate, while computational cost per token stays roughly constant. A model with 64 experts but top-2 routing has 32× more parameters than an equivalent dense model but only modestly more compute per token. This allows models with trillions of parameters (Switch Transformer, GLaM, Mixtral) to be trained and served at favourable efficiency.
The main challenges of MoE are training stability and load balancing. Without constraints, the gate might route most tokens to a few favoured experts, wasting capacity and producing imbalanced computation. Load balancing losses and expert capacity limits address this. Routing is non-differentiable (the top-$k$ operation), requiring various tricks like Gumbel-softmax or straight-through estimators. Modern MoE implementations are increasingly efficient and robust, and the approach has become an important tool for scaling models beyond what dense architectures can achieve economically. Mixtral-8x7B from Mistral AI demonstrated that high-quality open-weight MoE models are practical.
Related terms: Transformer, Large Language Model
Discussed in:
- Chapter 13: Attention & Transformers — Transformer Variants
- Chapter 15: Modern AI — Efficient AI
Also defined in: Textbook of AI