References

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, & Veselin Stoyanov (2019)

arXiv:1907.11692.

URL: https://arxiv.org/abs/1907.11692

Abstract. A controlled study of BERT's pretraining recipe. The authors find that BERT was substantially undertrained, RoBERTa is the same architecture trained on 10× more data with larger batches, longer sequences, dynamic masking, and without the next-sentence-prediction objective. The resulting model outperforms BERT on every benchmark and matches or exceeds the contemporaneous XLNet. The paper is the standard citation for the conclusion that NSP does not help, and for the broader observation that BERT's reported numbers were a function of training budget rather than architectural ceiling.

Tags: language-models pretraining bert

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).