Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, & Christopher Ré (2022), References, Textbook of AI

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, & Christopher Ré (2022)

arXiv.

DOI: https://doi.org/10.48550/arxiv.2205.14135

Abstract. Introduces FlashAttention, which restructures attention computation to minimise data movement between GPU high-bandwidth memory and on-chip SRAM. Without changing the mathematical form of attention, FlashAttention achieves substantial speedups and is now a de facto standard.

Tags: transformer attention efficiency flash-attention

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness