References

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, & Adrian Weller (2021)

International Conference on Learning Representations.

URL: https://arxiv.org/abs/2009.14794

Abstract. Introduces Performers, an efficient Transformer variant that approximates softmax attention with linear time and space complexity. Replaces softmax with FAVOR+, Fast Attention Via positive Orthogonal Random features, using random feature maps that give an unbiased estimator of the softmax kernel. The resulting attention is computable in $O(n d^2)$ rather than $O(n^2 d)$, enabling Transformers on sequences of tens of thousands of tokens. Performers were one of the first competitive linear-attention variants and informed the long-context model families that followed.

Tags: transformers efficient-attention

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).