Mamba, introduced by Albert Gu and Tri Dao in late 2023, is a state-space model (SSM) architecture for sequence modelling that achieves Transformer-quality language modelling at linear time complexity in sequence length, versus the Transformer's quadratic. It is the most prominent in a recent line of post-Transformer architectures aimed at the long-context regime.
State-space models
A continuous SSM evolves a hidden state $h(t) \in \mathbb{R}^N$ according to a linear ordinary differential equation:
$$h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t).$$
Discretising in time yields a recurrence $h_t = \bar A h_{t-1} + \bar B x_t,\ y_t = C h_t$, equivalent to a long convolution against a structured kernel $K = (CB, CAB, CA^2B, \dots)$. The structured state space (S4) family that Gu had developed since 2021 used carefully parameterised $A$ matrices (HiPPO, diagonal-plus-low-rank) so the convolution can be computed efficiently in the frequency domain.
The selection mechanism
The key innovation of Mamba is the selection mechanism: the SSM parameters $\bar A, \bar B, C$ become input-dependent, allowing the model to selectively propagate or forget information based on the current token. Earlier SSMs were time-invariant, fast but expressively limited, struggling on tasks requiring content-based reasoning. Selective SSMs trade away the convolutional view (the kernel is no longer fixed) for the expressiveness needed to match Transformers on language tasks.
Because the recurrence is now time-varying, it cannot be implemented as a single FFT. Instead Mamba uses a selective scan kernel that fuses the recurrent computation in a parallel-friendly way on GPU, exploiting the associative structure of the scan and avoiding materialisation of the full hidden-state sequence in HBM. The result is high throughput and linear scaling: a context of one million tokens is no harder per token than a context of a thousand.
Mamba-2 and successors
Mamba-2 (Dao and Gu, 2024) connects SSMs more closely to attention via a structured state-space duality (SSD), recasting the model as a structured matrix transformation that admits both recurrent and matrix-multiplication views. Mamba-2 is faster to train at scale and unifies a large family of efficient sequence-mixing primitives.
The wider field
State-space models including Mamba, RWKV (Bo Peng et al.), RetNet (Microsoft), Hyena (Stanford), and GLA represent the most serious challenge to the Transformer's dominance to emerge in the post-2017 era. Each replaces softmax attention with a sub-quadratic primitive while attempting to retain in-context-learning ability.
As of 2025, hybrid architectures combining attention with SSM blocks, Jamba (AI21), Zamba, Samba, and Nvidia's hybrid models, are the most promising practical direction; pure Transformers remain dominant in frontier closed models for now, but Mamba-style blocks have appeared in production code and edge-deployable models where long context and memory bandwidth dominate cost.
Video
Related terms: Transformer, State-Space Model, Attention Mechanism
Discussed in:
- Chapter 11: CNNs, Beyond the Transformer