AudioSet, Glossary, Textbook of AI

AudioSet (Gemmeke, Ellis, Freedman et al., ICASSP 2017) is Google's large-scale audio-event dataset, the canonical pretraining and benchmark resource for non-speech audio classification. It is to general audio what ImageNet was to vision.

Composition

AudioSet consists of 2,084,320 ten-second clips drawn from YouTube videos, hand-labelled against a 632-class ontology organised hierarchically into seven top-level domains: Human sounds, Animal sounds, Music, Natural sounds, Sounds of things, Source-ambiguous sounds and Channel/environment/background. Clips are typically labelled with multiple events; the average is 2-3 labels per clip.

Annotations were collected via Amazon Mechanical Turk with an extensive quality-control protocol. Clips are not redistributed: only YouTube video IDs and timestamps are published, requiring downstream users to fetch audio independently.

Models trained on AudioSet

AudioSet trained VGGish, the de facto baseline audio feature extractor; YAMNet (mobile-friendly successor); the PANN (Pre-trained Audio Neural Networks, Kong et al. 2020) family; AST (Audio Spectrogram Transformer); HTS-AT; and the audio encoder of CLAP (Contrastive Language-Audio Pretraining, Wu et al. 2022). It also features in the pretraining of AudioLM, MusicLM, MusicGen and Google's AudioPaLM as a supervised classification side-task.

Licensing

Annotations are released under CC-BY-4.0. The underlying YouTube audio is not relicensed; downstream users acquire it under YouTube's terms of service. Link rot is significant: a 2023 audit found 25-30% of original YouTube IDs no longer resolve, eroding the corpus.

Limitations

AudioSet's class distribution is heavily long-tailed: Music and Speech dominate, while many fine classes have fewer than 100 examples. Annotations are weak: clips are 10 seconds long but events may occupy a fraction of that window, and start-end timestamps within the clip are usually absent. Cultural skew is significant, Western popular music is over-represented relative to non-Western traditions.

Modern relevance

AudioSet remains the standard pretraining and evaluation set for general audio classification. It is increasingly supplemented by VGGSound, FSD50K (Freesound-sourced, fully open audio) and MMAU for multimodal evaluation, but the AudioSet pretraining pipeline retains state-of-the-art status on most non-speech audio downstream tasks.

Related terms: LibriSpeech and LibriLight, CLIP

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).