AudioSet (Gemmeke, Ellis, Freedman et al., ICASSP 2017) is Google's large-scale audio-event dataset, the canonical pretraining and benchmark resource for non-speech audio classification. It is to general audio what ImageNet was to vision.
Composition
AudioSet consists of 2,084,320 ten-second clips drawn from YouTube videos, hand-labelled against a 632-class ontology organised hierarchically into seven top-level domains: Human sounds, Animal sounds, Music, Natural sounds, Sounds of things, Source-ambiguous sounds and Channel/environment/background. Clips are typically labelled with multiple events; the average is 2-3 labels per clip.
Annotations were collected via Amazon Mechanical Turk with an extensive quality-control protocol. Clips are not redistributed: only YouTube video IDs and timestamps are published, requiring downstream users to fetch audio independently.
Models trained on AudioSet
AudioSet trained VGGish, the de facto baseline audio feature extractor; YAMNet (mobile-friendly successor); the PANN (Pre-trained Audio Neural Networks, Kong et al. 2020) family; AST (Audio Spectrogram Transformer); HTS-AT; and the audio encoder of CLAP (Contrastive Language-Audio Pretraining, Wu et al. 2022). It also features in the pretraining of AudioLM, MusicLM, MusicGen and Google's AudioPaLM as a supervised classification side-task.
Licensing
Annotations are released under CC-BY-4.0. The underlying YouTube audio is not relicensed; downstream users acquire it under YouTube's terms of service. Link rot is significant: a 2023 audit found 25-30% of original YouTube IDs no longer resolve, eroding the corpus.
Limitations
AudioSet's class distribution is heavily long-tailed: Music and Speech dominate, while many fine classes have fewer than 100 examples. Annotations are weak: clips are 10 seconds long but events may occupy a fraction of that window, and start-end timestamps within the clip are usually absent. Cultural skew is significant, Western popular music is over-represented relative to non-Western traditions.
Modern relevance
AudioSet remains the standard pretraining and evaluation set for general audio classification. It is increasingly supplemented by VGGSound, FSD50K (Freesound-sourced, fully open audio) and MMAU for multimodal evaluation, but the AudioSet pretraining pipeline retains state-of-the-art status on most non-speech audio downstream tasks.
Related terms: LibriSpeech and LibriLight, CLIP
Discussed in:
- Chapter 13: Attention & Transformers, Training Data and Web Corpora