Glossary

State-Space Model

A state-space model (SSM) describes a sequence via a continuous-time linear dynamical system

$$h'(t) = A\, h(t) + B\, x(t), \qquad y(t) = C\, h(t) + D\, x(t),$$

with hidden state $h(t) \in \mathbb{R}^N$, input $x(t)$, output $y(t)$, and learned (or structured) matrices $A, B, C, D$. Discretised with timestep $\Delta$ (zero-order hold or bilinear),

$$h_t = \bar A\, h_{t-1} + \bar B\, x_t, \qquad y_t = C\, h_t,$$

where $\bar A = \exp(\Delta A)$ and $\bar B = (\Delta A)^{-1}(\exp(\Delta A) - I)\,\Delta B$.

State-space models are the canonical formalism of classical control theory (Kalman 1960), signal processing (linear time-invariant systems) and time-series analysis (ARIMA, hidden Markov models can be cast as SSMs with discrete state). Their re-emergence as a competitor to the Transformer for sequence modelling is one of the major architectural stories of the 2020s.

Neural state-space models

  • HIPPO (Gu et al. 2020) gave a principled initialisation of $A$ that compresses the input history into the state via projection onto orthogonal polynomials, Legendre polynomials for uniform memory, Laguerre for exponentially-decaying memory. The HIPPO matrix has a closed-form structure that makes long-range memory tractable.
  • S4, Structured State Space (Gu, Goel, Ré 2022), a deep-learning layer with HIPPO-initialised $A$ structured as DPLR (diagonal-plus-low-rank), giving long-range memory at linear time complexity. S4 set state-of-the-art on the Long Range Arena benchmark, decisively beating Transformers on sequences up to 16 k tokens.
  • S5 (Smith et al. 2023) simplified S4 by using parallel scans on a fully diagonal state matrix.
  • Mamba (Gu & Dao 2023) added input-dependent $\bar B, C, \Delta$, a selective SSM in which the system parameters are functions of the current input, allowing the model to selectively remember or forget. Mamba matches Transformer language-modelling quality at linear cost in sequence length and constant memory in inference.
  • Mamba-2 (Dao & Gu 2024) connects SSMs to attention via structured state-space duality (SSD), showing that selective SSMs and a particular class of linear attention are mathematically equivalent.

Three computational views

The same SSM admits three equivalent computations, each useful in a different setting:

  • Recurrent view: $O(T)$ time per sequence step, $O(N)$ memory. Used at inference, where each token is fed in sequentially.
  • Convolutional view: the input–output map is convolution by a kernel of length $T$, $y_t = \sum_{k=0}^{t} \bar C \bar A^k \bar B \cdot x_{t-k}$. Computable in $O(T \log T)$ via FFT during training. This was the key insight of S4.
  • Parallel scan: associative-scan algorithms (Blelloch 1990) compute the full recurrence in $O(\log T)$ depth on parallel hardware. Used in Mamba's selective layers, where input-dependent parameters preclude the convolutional view.

Modern relevance

SSMs are the most prominent challenger to the Transformer architecture as of 2025. Their advantages, linear scaling in context length, constant memory at inference, hardware-friendly recurrence, make them attractive for very long sequences (genomics, audio, raw video, code repositories). Hybrid Transformer/SSM architectures, Jamba (AI21, 2024), Zamba (Zyphra, 2024), RecurrentGemma (DeepMind, 2024) and Samba , interleave attention with SSM layers and combine the strengths of both, often outperforming pure-attention models at large context.

Related terms: Mamba, Transformer, Recurrent Neural Network, Kalman Filter

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).