Rico Sennrich, Barry Haddow, & Alexandra Birch (2016), References, Textbook of AI

Rico Sennrich, Barry Haddow, & Alexandra Birch (2016)

Annual Meeting of the Association for Computational Linguistics.

URL: https://arxiv.org/abs/1508.07909

Abstract. Adapts byte-pair encoding (BPE), originally a 1994 data-compression algorithm by Philip Gage, to neural machine translation as a subword-tokenisation scheme. The training procedure: start with a character-level vocabulary, repeatedly merge the most-frequent adjacent symbol pair, and stop when the vocabulary reaches the target size. The result is a vocabulary that contains common words as single tokens and rare words as combinations of subwords, eliminating the unknown-token problem and dramatically improving translation of rare and morphologically complex words. BPE is the foundation of every modern language-model tokeniser, including GPT and Claude.

Tags: tokenisation language-models machine-translation

Cited in:

Chapter 12: Sequence Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Neural Machine Translation of Rare Words with Subword Units