Mamba is built on structured state-space models (SSMs). A continuous-time SSM with input $x(t)$, state $h(t) \in \mathbb{R}^N$ and output $y(t)$ is
$$h'(t) = A h(t) + B x(t), \quad y(t) = C h(t)$$
Discretised with timestep $\Delta$ (zero-order hold):
$$\bar A = e^{\Delta A}, \quad \bar B = (\Delta A)^{-1}(e^{\Delta A} - I) \cdot \Delta B$$
$$h_t = \bar A h_{t-1} + \bar B x_t, \quad y_t = C h_t$$
For standard SSMs the matrices $\bar A, \bar B, C, \Delta$ are fixed across time. Mamba's innovation is selectivity: $\bar B, C, \Delta$ depend on the current input $x_t$ via small projections:
$$B_t = W_B x_t, \quad C_t = W_C x_t, \quad \Delta_t = \mathrm{softplus}(W_\Delta x_t + b_\Delta)$$
This input-dependence allows the model to selectively propagate or forget information based on the content of the current token, recovering the expressiveness needed to match Transformers on language modelling tasks. (The earlier S4 family was input-independent and consequently weaker.)
Selective scan: the recurrence
$$h_t = \bar A_t h_{t-1} + \bar B_t x_t$$
is computed by an associative scan (parallel prefix sum of an associative operator) in $O(N)$ time and memory on GPU, with a custom hardware-aware kernel. This gives Mamba linear time complexity in sequence length $T$, compared to the Transformer's quadratic $T^2$.
Mamba block: input projection → 1D convolution (small kernel, for local mixing) → SiLU activation → selective SSM → gated output. Often paired with a parallel skip path. Layer normalisation and residual connections wrap the block as in a Transformer.
Empirical performance: Mamba matches or exceeds Transformer language-modelling quality at the same parameter count up to mid-sized models (1-3B parameters), with linear rather than quadratic scaling in context length. Above this scale, hybrid architectures (e.g. Jamba, Zamba) interleaving Mamba and attention blocks have been the most successful, and pure Mamba has not yet displaced the Transformer at frontier scale.
Mamba-2 (Dao & Gu, 2024) connects SSMs to attention via a structured state-space duality, showing that SSMs and attention are two views of the same underlying computation. The result has been a productive intellectual unification and has motivated continued architectural exploration , RetNet, RWKV, Hyena, S5 and others, at the SSM/attention boundary.
Video
Related terms: Mamba, albert-gu, tri-dao, Transformer, State-Space Model
Discussed in:
- Chapter 13: Attention & Transformers, Attention and Transformers