William Fedus, Barret Zoph, & Noam Shazeer (2021), References, Textbook of AI

William Fedus, Barret Zoph, & Noam Shazeer (2021)

arXiv.

DOI: https://doi.org/10.48550/arxiv.2101.03961

Abstract. Introduces the Switch Transformer, a mixture-of-experts model that routes each token to a single expert. The paper demonstrates that MoE models with over a trillion parameters can be trained efficiently, decoupling total parameters from per-token compute.

Tags: transformer mixture-of-experts scaling

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity