WaveNet (van den Oord et al., DeepMind, September 2016) is a deep generative model of raw audio waveforms that produced the first text-to-speech system to close the perceptual gap with human recordings. Its central insight: model the joint distribution over 16-bit PCM samples directly, using dilated causal convolutions to give each output sample an exponentially large receptive field without recurrent dependencies.
Autoregressive factorisation. WaveNet models the joint distribution over a waveform $x = (x_1, \ldots, x_T)$ as
$$p(x) = \prod_{t=1}^{T} p(x_t \mid x_1, \ldots, x_{t-1}, h),$$
where $h$ is auxiliary conditioning (linguistic features for TTS, speaker embedding, or class label). The output is categorical over 256 bins, obtained by passing the 16-bit signal through a $\mu$-law companding transform:
$$f(x_t) = \text{sign}(x_t) \frac{\ln(1 + \mu |x_t|)}{\ln(1 + \mu)}, \quad \mu = 255.$$
The training loss is cross-entropy over the 256-way softmax, far easier to optimise than mixture density networks over real-valued samples.
Causal convolution. A 1-D convolution is causal if output at time $t$ depends only on inputs $\le t$, implemented by left-padding the input and discarding outputs that would peek ahead. Stacking $L$ causal convolutions of kernel size $k$ gives receptive field $L(k-1) + 1$, which scales linearly, too slow for the 16 kHz audio rate.
Dilated convolution. WaveNet uses dilation $d$ that doubles per layer ($1, 2, 4, \ldots, 512$ within a stack, then resets and repeats). For dilation $d$, output is
$$y_t = \sum_{i=0}^{k-1} w_i \cdot x_{t - d \cdot i}.$$
Receptive field grows exponentially: a stack of 10 dilated layers covers 1024 samples (64 ms at 16 kHz), and three such stacks cover ~190 ms, sufficient for phonemic context.
Block. Each layer applies the gated activation from PixelCNN:
$$z = \tanh(W_{f,k} * x + V_{f,k}^\top h) \odot \sigma(W_{g,k} * x + V_{g,k}^\top h),$$
with $\odot$ elementwise product. Residual and skip connections ease optimisation: the residual path feeds the next dilated layer, while skip outputs sum across layers and pass through two $1 \times 1$ convolutions and a softmax to produce the categorical sample distribution.
Conditioning. Global conditioning (speaker ID) adds a single bias to every layer; local conditioning (linguistic features upsampled to audio rate) adds a time-varying bias.
Inference cost. Naive inference is $\mathcal{O}(T)$ sequential steps, generating one second of 16 kHz audio takes minutes on a GPU. Parallel WaveNet (van den Oord et al., 2018) distils the autoregressive teacher into an inverse-autoregressive flow student that generates in parallel. WaveRNN, WaveGlow, and MelGAN offered alternative speed-ups; today's neural codecs (EnCodec, SoundStream) replace WaveNet entirely for token-based audio LMs.
Legacy. WaveNet voices powered Google Assistant from 2017. The dilated-causal-convolution motif influenced PixelCNN++, ByteNet, Tacotron 2's neural vocoder, and the very idea that pure convolutions can model long-range sequential structure, a precursor to the Transformer's eventual dominance.
Video
Related terms: Convolution, Convolutional Neural Network, Cross-Entropy Loss, VALL-E, EnCodec
Discussed in:
- Chapter 12: Sequence Models, Audio Generation