Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, & Ilya Sutskever (2023), References, Textbook of AI

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, & Ilya Sutskever (2023)

International Conference on Machine Learning.

URL: https://arxiv.org/abs/2212.04356

Abstract. OpenAI's Whisper paper. Trains a single encoder-decoder Transformer on 680,000 hours of multilingual, weakly-labelled web audio paired with transcripts of varying quality. The model handles speech recognition, translation and language identification across 99 languages in a single network and matches or exceeds specialised commercial systems on most benchmarks. Whisper is the canonical example of an encoder-decoder Transformer where the encoder and decoder serve clearly distinct roles, and the open-weights release made high-quality multilingual transcription a commodity capability.

Tags: speech transformers multimodal

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Robust Speech Recognition via Large-Scale Weak Supervision