The RNN-Transducer (RNN-T) was introduced by Graves (2012) as a strict generalisation of CTC that removes its conditional-independence assumption while preserving streaming capability. It became the dominant architecture for on-device ASR after Google deployed it in Gboard and Pixel speech recognition (He et al., 2019).
Three networks. RNN-T factors the conditional distribution through three components:
- Encoder (transcription network) $f^{\text{enc}}: x_{1:T} \mapsto h^{\text{enc}}_{1:T}$, typically an LSTM, Conformer, or causal Transformer that consumes acoustic frames.
- Prediction network $f^{\text{pred}}: y_{0:u-1} \mapsto h^{\text{pred}}_u$, an autoregressive language model over emitted non-blank labels (LSTM or stateless embedding lookup).
- Joint network $f^{\text{joint}}(h^{\text{enc}}_t, h^{\text{pred}}_u) = W_o \tanh(W_e h^{\text{enc}}_t + W_p h^{\text{pred}}_u)$, projecting to logits over the vocabulary plus blank $\varnothing$.
Loss. Let $z_{t,u} = p(\cdot \mid t, u)$ be the joint output. A lattice is built on a $T \times U$ grid: at node $(t, u)$ the model either emits a non-blank token (move to $(t, u+1)$) or emits blank (move to $(t+1, u)$). The probability of label sequence $y$ marginalises over all monotonic alignment paths:
$$P(y \mid x) = \sum_{\pi \in \mathcal{A}(T, U)} \prod_{(t,u) \in \pi} p(\pi_{t,u} \mid t, u).$$
A two-dimensional forward-backward dynamic program computes this in $\mathcal{O}(TU |\mathcal{V}|)$ time. The loss is $-\log P(y \mid x)$ and gradients flow into all three networks.
Why it beats CTC. Because the prediction network conditions on previously emitted labels, RNN-T captures output dependencies that CTC cannot, for example, plural agreement, word boundaries, repeated digits. Empirically this lowers WER by 10-20% on conversational speech.
Streaming. With a causal encoder (unidirectional LSTM or Conformer with limited right context), RNN-T emits tokens as audio arrives, with latency bounded by the encoder's lookahead. This is why Apple, Google, Amazon, Microsoft, and Tencent all deploy RNN-T variants for keyboard dictation and voice assistants, Whisper-style attention models cannot stream.
Variants. Stateless prediction network (Ghodsi et al., 2020) replaces the LSTM with a small embedding window, halving latency. HAT (Hybrid Autoregressive Transducer) factors out an internal language-model score, enabling cleaner external LM fusion. Pruned RNN-T loss (Kuang et al., 2022) restricts the lattice to a narrow band, speeding training fivefold. Conformer-Transducer (Gulati et al., 2020) combines a Conformer encoder with stateless prediction, achieving state-of-the-art streaming WER on LibriSpeech.
Decoding. Beam search proceeds frame-by-frame: at each $t$, hypotheses extend by emitting non-blank labels (without advancing $t$) until a blank is emitted; merging by hypothesis prefix keeps the beam tractable.
Related terms: CTC Loss, Conformer, LSTM, Transformer, Whisper
Discussed in:
- Chapter 12: Sequence Models, Streaming Speech Recognition