Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, & Veselin Stoyanov (2019)
arXiv:1907.11692.
URL: https://arxiv.org/abs/1907.11692
Abstract. A controlled study of BERT's pretraining recipe. The authors find that BERT was substantially undertrained, RoBERTa is the same architecture trained on 10× more data with larger batches, longer sequences, dynamic masking, and without the next-sentence-prediction objective. The resulting model outperforms BERT on every benchmark and matches or exceeds the contemporaneous XLNet. The paper is the standard citation for the conclusion that NSP does not help, and for the broader observation that BERT's reported numbers were a function of training budget rather than architectural ceiling.
Tags: language-models pretraining bert
Cited in: