Rico Sennrich, Barry Haddow, & Alexandra Birch (2016)
Annual Meeting of the Association for Computational Linguistics.
URL: https://arxiv.org/abs/1508.07909
Abstract. Adapts byte-pair encoding (BPE), originally a 1994 data-compression algorithm by Philip Gage, to neural machine translation as a subword-tokenisation scheme. The training procedure: start with a character-level vocabulary, repeatedly merge the most-frequent adjacent symbol pair, and stop when the vocabulary reaches the target size. The result is a vocabulary that contains common words as single tokens and rare words as combinations of subwords, eliminating the unknown-token problem and dramatically improving translation of rare and morphologically complex words. BPE is the foundation of every modern language-model tokeniser, including GPT and Claude.
Tags: tokenisation language-models machine-translation
Cited in: