Attention & Transformers: 13.18   Sparse attention and mixture of experts

Dr Chris Paton

13.18 Sparse attention and mixture of experts

A different way to scale: instead of making every parameter active for every input, route each input through a small subset of parameters. Mixture of Experts (MoE) is the modern instantiation.

The idea

Replace the FFN of each Transformer block with a bank of $E$ FFNs (each called an expert). For each input token, a router network produces a probability distribution over the experts and selects the top $k$, typically $k = 1$ or $k = 2$ out of $E = 8$ to $E = 256$. Only the selected experts process the token; the others are skipped.

If $k = 2$ and $E = 8$, the parameter count grows by ~$8 \times$ but the FLOPs per token grow only by ~$2 \times$. You get a much larger model for not much more compute. This decouples capacity from inference cost.

Routing

The router is a small linear layer: $\mathbf{r} = \mathbf{W}_r \mathbf{x} \in \mathbb{R}^E$. Take the top-$k$ entries, softmax them, weight the selected experts by these probabilities. The router is trained jointly with the rest of the network. Two pathologies have to be controlled:

Load imbalance: the router collapses to always pick the same one or two experts. Solved with load-balancing losses (penalise the variance of expert usage) and capacity factors (cap the number of tokens any one expert can receive per batch).
Routing instability: tokens flip between experts during training. Solved with router z-losses, soft routing, expert dropout, etc.

Switch Transformer, GLaM, Mixtral, DeepSeek-V3

The first large-scale demonstration was Google's Switch Transformer Fedus, 2021, which used $k = 1$ routing across hundreds of experts. GLaM (2021) trained a 1.2-trillion-parameter MoE that used only a fraction of those parameters per token. Mixtral 8×7B (2023, Mistral) was a 47B-parameter MoE that ran at the speed of a 13B dense model and matched the quality of a 70B dense model. DeepSeek-V3 (December 2024) trained a 671 billion total / 37 billion active parameter MoE on 14.8 trillion tokens in 2.788 million H800 GPU-hours under FP8 mixed precision, with a reported training cost around $5.6 million; the design uses a fine-grained 256-expert layout with shared experts, and the model matched GPT-4-class performance at a fraction of the training and inference cost. GPT-4 itself is widely believed (though never officially confirmed) to be an MoE.

The MoE story is one of the strongest current trends: capacity through sparsity, not density. The price is engineering complexity, load balancing, expert parallelism across many GPUs, all-to-all communication patterns, but the rewards are worth it at frontier scale.

Sparse attention patterns

Beyond MoE, which sparsifies the FFN, there is also work on sparsifying the attention pattern itself. Longformer Beltagy, 2020 uses a sliding-window of attention plus a few global tokens. BigBird combines sliding-window, random, and global patterns. Both prove that with the right mixture of patterns, full attention's representational power can be approximated at linear cost.

In practice, sparse-attention patterns have not displaced full attention except in domains with naturally local structure (e.g. genomics, very long documents). The reason is partly that sparse attention is awkward to implement efficiently on GPUs (irregular memory access patterns), and partly that FlashAttention pushed the practical wall for full attention out far enough that for most workloads, full attention up to 100K-1M tokens is now feasible. The combination of FlashAttention + hybrid architectures has, for now, taken the pressure off pure sparse-attention solutions.

MoE training at frontier scale

The engineering of an MoE model at frontier scale is its own discipline. Each expert is itself a $\sim 10$B parameter FFN; with 64 to 256 experts, the model has hundreds of billions to trillions of parameters total. To fit it, experts are sharded across many GPUs (expert parallelism); to route tokens, an all-to-all communication is needed each layer (each token's selected experts may live on a different GPU). The all-to-all becomes the dominant cost in some configurations.

DeepSeek-V3's design uses fine-grained expert decomposition (lots of small experts, a few of which are activated per token, plus a few "shared" experts that are always active). This reduces variance in expert activation and pushes more useful work through the same FLOP budget. The training of DeepSeek-V3, reportedly under $6M of compute for a frontier-quality model, demonstrated how aggressive MoE design can change the cost equation of frontier AI.