Albert Gu & Tri Dao (2024), References, Textbook of AI

Albert Gu & Tri Dao (2024)

arXiv:2312.00752.

URL: https://arxiv.org/abs/2312.00752

Abstract. Introduces Mamba, the first non-attention sequence model that is competitive with Transformers at scale. Builds on structured state-space models (S4) by making the state-space parameters input-dependent, selective, which gives the model the ability to focus on or ignore information in a content-aware way. Implements a hardware-efficient parallel-scan algorithm so that training maintains the linear-time cost of the underlying recurrence. Mamba matches Transformer perplexity at the 1B-parameter scale on language modelling and outperforms it on long-sequence tasks like genomics and audio. Has become the leading architectural alternative to attention in 2024-2026.

Tags: sequence-models state-space-models transformers

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Mamba: Linear-Time Sequence Modeling with Selective State Spaces