References
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, & Noam Shazeer (2021)
arXiv .
DOI: https://doi.org/10.48550/arxiv.2101.03961
Abstract. Introduces the Switch
Transformer , a mixture-of-experts model that routes each token to a single expert. The paper demonstrates that MoE models with over a trillion parameters can be trained efficiently, decoupling total parameters from per-token compute.
Tags: transformer mixture-of-experts scaling
Previous Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, & Demis Hassabis (2015)
Next Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, & Ion Stoica (2023)
This site is currently in Beta. Contact: Chris Paton
Textbook of Usability · Textbook of Digital Health
Auckland Maths and Science Tutoring
AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).