Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, & Adrian Weller (2021)
International Conference on Learning Representations.
URL: https://arxiv.org/abs/2009.14794
Abstract. Introduces Performers, an efficient Transformer variant that approximates softmax attention with linear time and space complexity. Replaces softmax with FAVOR+, Fast Attention Via positive Orthogonal Random features, using random feature maps that give an unbiased estimator of the softmax kernel. The resulting attention is computable in $O(n d^2)$ rather than $O(n^2 d)$, enabling Transformers on sequences of tens of thousands of tokens. Performers were one of the first competitive linear-attention variants and informed the long-context model families that followed.
Tags: transformers efficient-attention
Cited in: