RNN-Transducer, Glossary, Textbook of AI

The RNN-Transducer (RNN-T) was introduced by Graves (2012) as a strict generalisation of CTC that removes its conditional-independence assumption while preserving streaming capability. It became the dominant architecture for on-device ASR after Google deployed it in Gboard and Pixel speech recognition (He et al., 2019).

Three networks. RNN-T factors the conditional distribution through three components:

Encoder (transcription network) $f^{\text{enc}}: x_{1:T} \mapsto h^{\text{enc}}_{1:T}$, typically an LSTM, Conformer, or causal Transformer that consumes acoustic frames.
Prediction network $f^{\text{pred}}: y_{0:u-1} \mapsto h^{\text{pred}}_u$, an autoregressive language model over emitted non-blank labels (LSTM or stateless embedding lookup).
Joint network $f^{\text{joint}}(h^{\text{enc}}_t, h^{\text{pred}}_u) = W_o \tanh(W_e h^{\text{enc}}_t + W_p h^{\text{pred}}_u)$, projecting to logits over the vocabulary plus blank $\varnothing$.

Loss. Let $z_{t,u} = p(\cdot \mid t, u)$ be the joint output. A lattice is built on a $T \times U$ grid: at node $(t, u)$ the model either emits a non-blank token (move to $(t, u+1)$) or emits blank (move to $(t+1, u)$). The probability of label sequence $y$ marginalises over all monotonic alignment paths:

$$P(y \mid x) = \sum_{\pi \in \mathcal{A}(T, U)} \prod_{(t,u) \in \pi} p(\pi_{t,u} \mid t, u).$$

A two-dimensional forward-backward dynamic program computes this in $\mathcal{O}(TU |\mathcal{V}|)$ time. The loss is $-\log P(y \mid x)$ and gradients flow into all three networks.

Why it beats CTC. Because the prediction network conditions on previously emitted labels, RNN-T captures output dependencies that CTC cannot, for example, plural agreement, word boundaries, repeated digits. Empirically this lowers WER by 10-20% on conversational speech.

Streaming. With a causal encoder (unidirectional LSTM or Conformer with limited right context), RNN-T emits tokens as audio arrives, with latency bounded by the encoder's lookahead. This is why Apple, Google, Amazon, Microsoft, and Tencent all deploy RNN-T variants for keyboard dictation and voice assistants, Whisper-style attention models cannot stream.

Variants. Stateless prediction network (Ghodsi et al., 2020) replaces the LSTM with a small embedding window, halving latency. HAT (Hybrid Autoregressive Transducer) factors out an internal language-model score, enabling cleaner external LM fusion. Pruned RNN-T loss (Kuang et al., 2022) restricts the lattice to a narrow band, speeding training fivefold. Conformer-Transducer (Gulati et al., 2020) combines a Conformer encoder with stateless prediction, achieving state-of-the-art streaming WER on LibriSpeech.

Decoding. Beam search proceeds frame-by-frame: at each $t$, hypotheses extend by emitting non-blank labels (without advancing $t$) until a blank is emitted; merging by hypothesis prefix keeps the beam tractable.

Related terms: CTC Loss, Conformer, LSTM, Transformer, Whisper

Discussed in:

Chapter 12: Sequence Models, Streaming Speech Recognition

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).