References

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, & Christopher Ré (2022)

arXiv.

DOI: https://doi.org/10.48550/arxiv.2205.14135

Abstract. Introduces FlashAttention, which restructures attention computation to minimise data movement between GPU high-bandwidth memory and on-chip SRAM. Without changing the mathematical form of attention, FlashAttention achieves substantial speedups and is now a de facto standard.

Tags: transformer attention efficiency flash-attention

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).