MusicGen (Copet, Kreuk, Gat et al., Meta AI, Simple and Controllable Music Generation, NeurIPS 2023) is a single-stage Transformer language model that generates 32 kHz music from text descriptions, melody contours, or both. Its central contribution is the delayed pattern that lets one Transformer predict all four EnCodec codebook streams jointly, removing the need for AudioLM's three-stage hierarchy without sacrificing quality.
Tokenisation. Audio is encoded with a 32 kHz EnCodec at 50 Hz with $K = 4$ residual codebooks of 2048 entries each, yielding 200 tokens per second of music. Naively flattening to 800 tokens per second would blow up sequence length; running the four streams in parallel at the same timestep would violate causal structure since later codebooks depend on earlier ones at the same step.
Delayed codebook pattern. MusicGen's solution is to delay codebook $k$ by $k-1$ steps. The reordered sequence at step $t$ contains $(c^1_t, c^2_{t-1}, c^3_{t-2}, c^4_{t-3})$, predicted in parallel. After $K$ initial steps the model produces $K$ tokens per step, recovering full coverage. Information dependency is preserved because $c^k_t$ has access via attention to all earlier-codebook tokens at timestep $t$ , they appear in the input at recent positions. This factor-of-K speedup in sequence length is the key to feasibility.
Architecture. A decoder-only Transformer (300 M, 1.5 B, or 3.3 B parameters) with causal self-attention, sinusoidal positional encoding, and pre-LayerNorm. The output projection has a separate head per codebook, since the four codebooks have independent vocabularies. Training loss is the sum over codebooks:
$$\mathcal{L} = -\sum_{t=1}^{T} \sum_{k=1}^{K} \log p_\theta(c^k_t \mid c_{\lt t}, \text{cond}).$$
Conditioning.
- Text-to-music: a frozen T5 or FLAN-T5 encoder embeds the text caption (e.g. "80s synthwave with a driving bassline"); cross-attention layers in the Transformer attend to the text embeddings.
- Melody-conditioned: a chromagram of a reference audio is computed (12 pitch classes per frame), unsupervised dominant-pitch extraction yields a melody signal, and chroma embeddings are concatenated to the input. This enables "play this tune in style X".
Classifier-free guidance. During training, conditioning is dropped 10% of the time. At inference, the model samples logits as
$$\hat{\ell} = (1 + w) \, \ell_{\text{cond}} - w \, \ell_{\text{uncond}},$$
with guidance scale $w = 3$, sharpening adherence to the prompt at the cost of mode coverage.
Training data. 20 k hours of licensed music: ShutterStock and Pond5 stock catalogues plus 10 k high-quality internal tracks. Captions come from track metadata enriched with genre, mood, and instrument tags.
Capabilities and limits. MusicGen produces 30-second clips with coherent melody, harmony, and rhythm; it generalises to instrument combinations unseen at training. It struggles with long-form structure (verse-chorus form), vocals (no lyric conditioning), and timbral fidelity at high frequencies. MusicGen-Stereo (2024) added stereo output, and AudioGen is the sibling text-to-sound-effect model. Open weights and inference code on Hugging Face made MusicGen the default research baseline for controllable music generation.
Related terms: EnCodec, Transformer, Vector Quantisation, AudioLM, VALL-E
Discussed in:
- Chapter 12: Sequence Models, Music Generation