Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, & Adrian Weller (2021), References, Textbook of AI

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, & Adrian Weller (2021)

International Conference on Learning Representations.

URL: https://arxiv.org/abs/2009.14794

Abstract. Introduces Performers, an efficient Transformer variant that approximates softmax attention with linear time and space complexity. Replaces softmax with FAVOR+, Fast Attention Via positive Orthogonal Random features, using random feature maps that give an unbiased estimator of the softmax kernel. The resulting attention is computable in $O(n d^2)$ rather than $O(n^2 d)$, enabling Transformers on sequences of tens of thousands of tokens. Performers were one of the first competitive linear-attention variants and informed the long-context model families that followed.

Tags: transformers efficient-attention

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Rethinking Attention with Performers