Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, & Christopher Ré (2022)
arXiv.
DOI: https://doi.org/10.48550/arxiv.2205.14135
Abstract. Introduces FlashAttention, which restructures attention computation to minimise data movement between GPU high-bandwidth memory and on-chip SRAM. Without changing the mathematical form of attention, FlashAttention achieves substantial speedups and is now a de facto standard.
Tags: transformer attention efficiency flash-attention
Cited in: