AudioLM, Glossary, Textbook of AI

AudioLM (Borsos, Marinier, Vincent et al., Google, AudioLM: a Language Modeling Approach to Audio Generation, IEEE TASLP 2023) introduced hierarchical token modelling for audio, splitting generation into a slow semantic stream that captures linguistic and prosodic content and a fast acoustic stream that captures fine waveform detail. This factorisation solves the long-standing tension in audio LMs between coherence and fidelity.

Two token streams.

Semantic tokens are obtained by running 16 kHz audio through a self-supervised speech model (w2v-BERT XL) and k-means clustering ($K = 1024$) the 7-th-layer activations at 25 Hz. These tokens correlate strongly with phonetic and prosodic content but lose speaker identity and recording acoustics.
Acoustic tokens come from SoundStream, a neural codec at 50 Hz with 12-level residual vector quantisation (each level 1024 entries). They preserve full waveform detail including speaker timbre and room acoustics.

Three-stage hierarchical model. Three separate decoder-only Transformers are trained:

Stage 1: Semantic modelling. Plain autoregressive next-token prediction over the semantic stream:

$$p(s_{1:T_s}) = \prod_t p(s_t \mid s_{\lt t}).$$

This stage learns what to say: phonemes, words, sentence-level structure.
Stage 2: Coarse acoustic modelling. Conditioned on the full semantic sequence, predict the first 4 of 12 acoustic codebooks autoregressively:

$$p(a^{1:4}_{1:T_a} \mid s_{1:T_s}) = \prod_t \prod_{j=1}^{4} p(a^j_t \mid a^{1:4}_{\lt t}, a^{1:j-1}_t, s_{1:T_s}).$$

This captures speaker identity and prosody.
Stage 3: Fine acoustic modelling. Conditioned on the coarse codebooks, predict the remaining 8 codebooks (5-12). This stage is local, fine codes mostly depend on coarse codes at the same and nearby timesteps, so the model can use a small context window.

Each Transformer is trained with the standard cross-entropy next-token loss and uses delayed pattern interleaving to keep sequence lengths tractable.

Inference. Generation starts with a prompt: a short audio clip is tokenised into semantic and acoustic tokens, the semantic Transformer continues the semantic stream, the coarse model fleshes it out with speaker-consistent acoustic codes, and the fine model adds high-frequency detail. Detokenisation through SoundStream yields 16 kHz waveform.

Capabilities. AudioLM can continue piano improvisations for 30 seconds with stylistic and rhythmic coherence, and continue speech preserving speaker identity, accent, recording acoustics, and prosody, without any text supervision. This textless paradigm proved that language modelling on appropriate discrete units works for arbitrary audio.

Successors. MusicLM (Agostinelli et al., 2023) added MuLan text-conditioning for text-to-music. SingSong generates vocal accompaniment. SPEAR-TTS combined AudioLM with text supervision. VALL-E simplified the hierarchy to two stages by using EnCodec directly. MusicGen further compressed to a single stage using delayed-codebook interleaving.

Related terms: EnCodec, Transformer, Vector Quantisation, VALL-E, MusicGen, wav2vec 2.0

Discussed in:

Chapter 12: Sequence Models, Audio Generation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).