Audio foundation models are large pretrained models for which audio, speech, music, or environmental sound, is a first-class input or output modality. The space has expanded rapidly beyond OpenAI's Whisper (a 2022 ASR model) into multilingual speech-to-speech translation, audio understanding LLMs, and unified text-and-audio generators.
Architectural common ground. Most audio foundation models share three components:
Audio tokeniser. A learned codec that converts raw waveform (16 or 24 kHz) into a discrete or continuous token sequence at $\sim$25–75 frames per second. EnCodec (Meta 2022), SoundStream (Google 2021), and Mimi (Kyutai 2024) are common choices. A typical 1-second audio clip becomes 50–150 tokens.
Transformer. A decoder-only or encoder-decoder transformer operating on audio tokens, optionally interleaved with text tokens.
Detokeniser. The inverse of the codec, mapping generated tokens back to a waveform.
AudioPaLM (Rubenstein et al. 2023, Google). Combines PaLM-2 with AudioLM's discrete audio tokens in a single vocabulary. Supports speech-to-text, text-to-speech, and speech-to-speech translation: speak English into the model, hear the same content in Spanish in your own voice. Training data interleaves text-only PaLM data, ASR transcripts, TTS pairs, and parallel speech corpora.
SeamlessM4T (Meta AI 2023). "Massively Multilingual & Multimodal Machine Translation". A single model that performs ASR, text-to-text translation, speech-to-text translation, text-to-speech, and speech-to-speech translation across 100+ languages. Architecture: w2v-BERT 2.0 speech encoder, NLLB-derived text encoder, separate text decoder and unit-based speech decoder. SeamlessM4T v2 (Dec 2023) added expressive prosody preservation and on-device variants.
Qwen-Audio and Qwen2-Audio (Alibaba 2023, 2024). Audio-input LLMs in the Qwen family. Take audio (speech, music, environmental sound) as input and produce text; trained on a 30-task mixture (transcription, translation, captioning, sound classification, music understanding) to encourage instruction-following. Qwen2-Audio handles dual-mode input (voice chat with the model, or audio analysis tasks).
AudioLM (Borsos et al. 2023, Google). Generates continuations of audio (speech or music) without text supervision, using a hierarchy of "semantic" and "acoustic" tokens to capture content and prosody respectively. Influential for showing that any audio could be modelled autoregressively as tokens.
GPT-4o audio. OpenAI's GPT-4o handles audio in and out natively, with $\sim$320ms end-to-end latency, supporting voice chat, real-time translation, and prosodic expression (laughter, singing) that pipelined ASR→LLM→TTS systems lose.
Mathematical view. Audio foundation models are language models over a token vocabulary that includes audio tokens:
$$p(\mathbf{y}_{1:T} \mid \mathbf{x}) = \prod_{t=1}^{T} p(y_t \mid y_{\lt t}, \mathbf{x}; \theta)$$
where $\mathbf{x}$ may include text or audio tokens, and $\mathbf{y}$ may produce text or audio tokens. The unification with text means audio is not "translated" into text first, the model can preserve speaker identity, emotion, and timing.
Significance. Audio foundation models close the loop on multimodal AI: a single transformer can listen, think, and speak. Combined with vision-language models, they enable assistants like GPT-4o Voice Mode, Gemini Live, and Claude voice modes that hold real-time multimodal conversations.
Related terms: Whisper, AudioLM, Vision-Language Model, GPT-4V and GPT-4o Vision, Gemini Multimodal, Transformer
Discussed in:
- Chapter 11: CNNs, Audio Foundation Models