Noam Shazeer (2019), References, Textbook of AI

Noam Shazeer (2019)

arXiv.

DOI: https://doi.org/10.48550/arxiv.1911.02150

Abstract. Introduces multi-query attention (MQA), which shares a single set of key and value projections across attention heads. MQA dramatically reduces the memory required for the KV cache during autoregressive generation.

Tags: transformer attention efficiency mqa

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Fast Transformer Decoding: One Write-Head is All You Need