Abstract. Introduces multi-query attention (MQA), which shares a single set of key and value projections across attention heads. MQA dramatically reduces the memory required for the KV cache during autoregressive generation.
Tags:transformerattentionefficiencymqa
This site is currently in Beta. Contact: Chris Paton