Attention & Transformers: 13.15   Linear attention: Linformer, Performer, RetNet, RWKV, Mamba

Dr Chris Paton

13.15 Linear attention: Linformer, Performer, RetNet, RWKV, Mamba

§13.14 took the quadratic cost as a fixed problem and asked how cleverly the GPU could be coaxed into paying it. FlashAttention paid the full $O(n^2)$ in arithmetic but avoided ever materialising the $n \times n$ score matrix in high-bandwidth memory; the result was an exact attention layer that ran several times faster than a naive implementation. That is the IO-aware exact branch.

This section takes the opposite tack. What if we accept a slightly different layer, one that is no longer mathematically identical to softmax attention, in exchange for genuine $O(n)$ scaling in the sequence length? This is the linear-attention branch. The motivation is straightforward: even with FlashAttention, doubling the context still quadruples the compute and memory traffic. For very long sequences, whole books, hour-long audio, megabase DNA, hours of robot telemetry, quadratic cost dominates everything, and the constants FlashAttention saves do not change the asymptote.

Linear-attention methods are not a single algorithm but a family. Each member starts from a different observation about what softmax attention is really doing, then designs a cheaper operator that captures enough of the same behaviour. Linformer projects the keys and values into a smaller fixed-size summary. Performer rewrites the softmax kernel as an inner product of randomised feature maps and exploits associativity to factor the computation. RetNet expresses attention as a recurrence with a parallel matrix form for training and a recurrent state form for inference. RWKV is a similar reformulation that sits closer to the RNN family. Mamba is not strictly an attention variant at all; it is a selective state-space model that performs the same role as an attention layer in a Transformer block but uses entirely different machinery.

None of these methods has fully replaced softmax attention in frontier models. But several are practical for long contexts, and the active research front-runner, Mamba and its hybrids, is genuinely competitive with attention at moderate scales and dominates it for long-range tasks where the quadratic cost would be prohibitive.

Symbols Used Here

$n$sequence length

$d_k, d_v$key and value dimensions

$\phi$feature map (deterministic or randomised)

$k$projection rank (Linformer) or feature-map dimension (Performer)

$\mathbf{h}_t$recurrent or state-space hidden state at step $t$

The general approach

Softmax attention computes $\operatorname{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k}) \mathbf{V}$. The expensive object is the score matrix $\mathbf{Q}\mathbf{K}^\top \in \mathbb{R}^{n \times n}$, which is constructed in full before the softmax can be applied row-wise. Two structural facts make the cost unavoidable in the exact form: the row-wise softmax couples every score in a row, and matrix multiplication is not associative once a non-linearity (the softmax) sits between the two factors $\mathbf{Q}\mathbf{K}^\top$ and $\mathbf{V}$.

The general trick of linear attention is to replace the softmax kernel with one that can be written as an inner product of feature maps, $K(\mathbf{q}, \mathbf{k}) \approx \langle \phi(\mathbf{q}), \phi(\mathbf{k}) \rangle$. Once that substitution is in place, the layer becomes

$$ \operatorname{LinAttention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \phi(\mathbf{Q}) \big( \phi(\mathbf{K})^\top \mathbf{V} \big), $$

with a normalising denominator that follows the same pattern. The inner factor $\phi(\mathbf{K})^\top \mathbf{V}$ has shape $r \times d_v$ where $r$ is the feature dimension, and computing it costs $O(n r d_v)$. The outer product with $\phi(\mathbf{Q})$ adds another $O(n r d_v)$. Crucially, neither step ever forms an $n \times n$ matrix, so memory and compute are linear in $n$.

The same idea has a recurrent reading. If $\phi$ is fixed, $\phi(\mathbf{K})^\top \mathbf{V} = \sum_{i=1}^n \phi(\mathbf{k}_i) \mathbf{v}_i^\top$ is a running sum over the sequence, a state of size $r \times d_v$ that can be updated one token at a time. At inference the layer is a finite-state recurrence; at training the same computation runs as a parallel scan or a matrix multiplication. The duality between a parallel matrix form and a sequential state form is the deep reason linear attention is appealing: training stays GPU-friendly, but autoregressive decoding becomes $O(1)$ per token instead of $O(n)$.

The cost of the substitution is expressiveness. Softmax attention can sharply attend to a single token among thousands; a linear approximation cannot, in general, reproduce that delta-function behaviour with a small feature dimension. Empirically, linear-attention models struggle on tasks that need exact retrieval, copying a string, looking up a fact, resolving a long-range coreference, unless the feature dimension is large or the architecture mixes in some genuine attention layers. The methods below differ mainly in how they manage that trade-off.

Linformer

Linformer Wang, 2020 starts from an empirical observation: the attention matrix produced by trained Transformers is approximately low-rank. Most of the variance of $\operatorname{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$ lives in a subspace of dimension far smaller than $n$. The authors exploit this by applying fixed learned projections $\mathbf{E}, \mathbf{F} \in \mathbb{R}^{n \times k}$ along the sequence dimension of the keys and values, with $k \ll n$:

$$ \operatorname{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) \approx \operatorname{softmax}\!\left(\frac{\mathbf{Q} (\mathbf{E} \mathbf{K})^\top}{\sqrt{d_k}}\right) (\mathbf{F} \mathbf{V}). $$

The score matrix is now $n \times k$, and the layer cost is $O(nk)$, linear in $n$ for fixed $k$.

Linformer's drawback is structural rather than statistical. Because $\mathbf{E}$ and $\mathbf{F}$ are matrices with a fixed first dimension $n$, the model is committed to a single sequence length at training time and cannot extrapolate to longer inputs. Variable-length inputs require padding; very long contexts require retraining with a new $n$. In practice, Linformer is used where input length is naturally bounded, document classification, paragraph-level retrieval, fixed-size protein windows, and where the savings in compute and memory matter more than the ability to handle arbitrary lengths. The low-rank insight has, however, outlived the specific architecture: the observation that trained attention matrices live near a low-rank subspace recurs in the analysis of compression, distillation, and approximate KV-cache schemes.

Performer

Performer Choromanski, 2021 takes the kernel-approximation route. The unnormalised softmax kernel is

$$ K(\mathbf{q}, \mathbf{k}) = \exp(\mathbf{q}^\top \mathbf{k} / \sqrt{d_k}), $$

and the FAVOR+ algorithm at the heart of Performer expresses this as the expectation of an inner product over random features:

$$ K(\mathbf{q}, \mathbf{k}) = \mathbb{E}\big[\phi(\mathbf{q})^\top \phi(\mathbf{k})\big], $$

with $\phi$ a randomised positive-orthogonal feature map of dimension $r$. Substituting the empirical mean of $\phi$ for the expectation gives an unbiased estimator of the kernel, and associativity then yields

$$ \operatorname{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) \approx \frac{\phi(\mathbf{Q}) (\phi(\mathbf{K})^\top \mathbf{V})}{\phi(\mathbf{Q}) (\phi(\mathbf{K})^\top \mathbf{1})}. $$

The numerator and denominator are both linear in $n$, and the approximation error decays as $1/\sqrt{r}$.

The mathematics is clean, Performer is the cleanest example of the kernel-approximation idea, but in practice the layer has not displaced softmax attention in any frontier model. The estimator is unbiased on average, yet its variance is non-trivial at small $r$, and large $r$ erodes the speed advantage. More importantly, the random feature map produces softer score distributions than true softmax, which hurts on the same exact-retrieval tasks that trouble linear attention generally. Performer remains an important reference point, both as a clean theoretical statement of the linear-attention idea and as a baseline in long-sequence benchmarks.

RetNet

Retentive Networks Sun, 2023 reformulate the layer as a recurrence with a fixed-size state. The retention operator is mathematically equivalent to a particular linear-attention variant with a complex-exponential decay and admits two computationally distinct forms. The parallel form looks like attention: it computes a matrix product and then masks. The recurrent form looks like an RNN: it carries a $d \times d$ state $\mathbf{S}_t$ and updates it as

$$ \mathbf{S}_t = \gamma \mathbf{S}_{t-1} + \mathbf{k}_t \mathbf{v}_t^\top, \qquad \mathbf{o}_t = \mathbf{q}_t \mathbf{S}_t, $$

where $\gamma \in (0, 1)$ is a decay factor.

The two forms compute the same function. Training uses the parallel form to keep the GPU busy; autoregressive inference uses the recurrent form, which costs $O(d^2)$ per token regardless of how many tokens have already been generated. RetNet's promise is the combination of attention-style training throughput with RNN-style inference cost. Reported results show competitive perplexity with same-size Transformers at moderate scales, although replication outside the original lab has been less consistent than for, say, Mamba.

RWKV

RWKV Peng, 2023 sits closer to the RNN family. Its core layer is a hand-designed mixture of a linear-attention-style channel and a token-shift channel, again admitting a parallelisable training form and an RNN-like inference form. The architecture has been pushed further than most linear-attention variants: open RWKV models exist up to roughly fourteen billion parameters, trained on standard language-modelling corpora.

At those scales, RWKV is competitive with similarly sized Transformers on standard benchmarks (HellaSwag, ARC, PIQA), at significantly lower inference cost per token because each layer carries a fixed-size state rather than a growing KV cache. The remaining gap is mostly on tasks that demand exact retrieval over long contexts, which is precisely the regime where pure linear attention is weakest. RWKV's main contribution may be social as much as technical: it demonstrated that an RNN-shaped architecture trained at scale on modern hardware could remain in the same league as a Transformer, which had become an unfashionable claim by 2023.

Mamba (state-space models)

Mamba Gu, 2024 is the most successful non-attention sequence model of the past several years and the one most likely to take a permanent place alongside attention in production systems. It builds on state-space models, linear time-invariant systems

$$ \mathbf{h}_t = \mathbf{A} \mathbf{h}_{t-1} + \mathbf{B} \mathbf{x}_t, \qquad \mathbf{y}_t = \mathbf{C} \mathbf{h}_t, $$

with structured transition matrices $\mathbf{A}$ that allow efficient computation of long convolutions through frequency-domain tricks. Earlier members of the family (S4, S5) showed that such systems could model very long sequences cheaply but were not competitive with Transformers on language.

Mamba's innovation is selectivity: the parameters $\mathbf{A}, \mathbf{B}, \mathbf{C}$ are made input-dependent. Given the current token, the model selectively chooses how strongly to propagate or forget components of the hidden state. This is conceptually similar to the gating in an LSTM, but realised inside a structured state-space framework that admits a hardware-aware parallel-scan implementation on GPUs. The result is a layer that runs in $O(n)$ time and $O(1)$ memory per token at inference, scales to context lengths of millions of tokens, and matches Transformer quality on language modelling at small to moderate scale.

The architectural debate between Transformers and state-space models is now genuinely open for long-context applications. Mamba clearly wins on raw throughput and memory at long contexts. It loses on tasks that need precise associative recall, copying a long string, multi-hop lookup over a context, where attention's ability to point sharply remains important.

A practical pattern that has emerged is hybrid architectures: stacks that interleave Mamba layers (cheap, long-range, $O(n)$) with attention layers (expressive, $O(n^2)$ but only every few layers). Models such as Jamba and Samba demonstrate that hybrids can outperform either pure architecture on long-context benchmarks, recovering attention's recall while keeping most of Mamba's throughput. As of early 2026, the most credible long-context production systems use either pure attention with FlashAttention plus heavy KV-cache engineering or a Mamba-attention hybrid.

Where these live in 2026

Frontier dense models still use quadratic attention. The combination of FlashAttention, grouped-query attention, paged KV caches, and the entire mature GPU software stack absorbs the quadratic cost up to roughly a million tokens, which covers the contexts production users actually exercise. For those workloads, the engineering investment in attention is a moat, not just an algorithm.

Linear-attention methods have found niches. Linformer and Performer are mostly historical reference points but still appear in long-document classification pipelines and as research baselines. RetNet and RWKV are deployed at modest scale in latency-sensitive inference settings where the fixed-size state is a decisive advantage over a growing KV cache. Mamba and its hybrids are the active frontier, used at scale for genomic sequence modelling, long-form audio, time-series forecasting, and increasingly for language modelling research where context lengths in the millions are required. The practical wager is that for the next several years attention will remain the core primitive, with hybrids absorbing the load when the context grows beyond what FlashAttention can comfortably handle, and with a real prospect that the dominant architecture five years out is neither pure attention nor pure state-space but a carefully tuned mixture.

What you should take away

Linear-attention methods replace exact softmax attention with approximations that scale as $O(n)$ in sequence length, trading expressiveness for throughput.
The unifying mathematical idea is to write the attention kernel as an inner product of feature maps, $K(\mathbf{q}, \mathbf{k}) \approx \langle \phi(\mathbf{q}), \phi(\mathbf{k}) \rangle$, then exploit associativity to avoid forming the $n \times n$ score matrix.
The five canonical methods take different routes: Linformer projects keys and values into a low-rank summary; Performer uses randomised feature maps; RetNet and RWKV recast the layer as a parallel-trainable recurrence; Mamba abandons attention entirely for selective state-space models.
The persistent weakness of all linear methods is exact retrieval over long contexts; this is why hybrid Mamba-attention stacks have become the most credible long-context architecture in 2026.
Frontier models still default to quadratic attention with FlashAttention, but the quadratic wall is no longer where the action is, long-context research has moved to state-space models and hybrids.