Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, & Ilya Sutskever (2023)
International Conference on Machine Learning.
URL: https://arxiv.org/abs/2212.04356
Abstract. OpenAI's Whisper paper. Trains a single encoder-decoder Transformer on 680,000 hours of multilingual, weakly-labelled web audio paired with transcripts of varying quality. The model handles speech recognition, translation and language identification across 99 languages in a single network and matches or exceeds specialised commercial systems on most benchmarks. Whisper is the canonical example of an encoder-decoder Transformer where the encoder and decoder serve clearly distinct roles, and the open-weights release made high-quality multilingual transcription a commodity capability.
Tags: speech transformers multimodal
Cited in: