Sequence Models: 12.4   Word representations

Dr Chris Paton

12.4 Word representations

The first step in any neural language or sequence model is to convert each discrete token into a vector. The history of word representations is the story of moving from sparse, atomic representations to dense, semantic ones.

12.4.1 One-hot encoding and the curse of orthogonality

Given a vocabulary $\mathcal{V}$ with $|\mathcal{V}| = V$, the one-hot encoding maps the $i$-th word to the vector $e_i \in \mathbb{R}^V$ with a $1$ in position $i$ and zeros elsewhere. Every pair of distinct words is orthogonal: $e_i \cdot e_j = \delta_{ij}$. This is mathematically convenient but linguistically catastrophic: the cosine similarity of "cat" and "dog" equals the cosine similarity of "cat" and "democracy", which is zero.

A one-hot vector multiplied by a matrix $W \in \mathbb{R}^{d \times V}$ on the left simply selects column $i$ of $W$. For this reason, every neural language model in practice replaces one-hot encoding by a lookup table: a learnable matrix $E \in \mathbb{R}^{V \times d}$ in which each row is the embedding of one vocabulary item. The embedding of word $i$ is $E_i \in \mathbb{R}^d$, and it is treated as a learnable parameter trained jointly with the rest of the network. Typical $d$ ranges from 100 to 1024; the embedding table is by far the largest parameter block in many language models.

12.4.2 The distributional hypothesis

Why should we expect dense vectors to capture meaning at all? The empirical answer rests on the distributional hypothesis, attributed to the linguists Zellig Harris and J.R. Firth: "you shall know a word by the company it keeps". Words that appear in similar contexts tend to have similar meanings. If we can build vectors so that contextual similarity is reflected as geometric similarity, then those vectors capture, to a first approximation, semantic similarity.

Two strategies have been pursued. Count-based methods build a co-occurrence matrix $X \in \mathbb{R}^{V \times V}$ where $X_{ij}$ is, for example, the number of times word $j$ appeared within a small window of word $i$, then reduce its dimension via a matrix factorisation (truncated SVD, latent semantic analysis). Predictive methods train a neural network to predict context from word, or word from context, and read off the embeddings from the trained weights. Word2vec and its descendants are predictive; GloVe is a hybrid; both classes converge to qualitatively similar representations.

12.4.3 Word2vec: skip-gram with negative sampling

Mikolov et al. 2013 introduced the word2vec family. Its two training objectives are:

Continuous bag-of-words (CBOW): predict the centre word from the average of its surrounding context words.
Skip-gram: predict each context word from the centre word.

Skip-gram is the more widely used variant. Fix a context window of half-size $c$ (typically 5 to 10). For every centre word $w_t$ in a corpus, the model defines a probability of any other word $w_{t+j}$ ($j \in \{-c, \ldots, -1, 1, \ldots, c\}$) being a context word given the centre:

$$P(w_{t+j} \mid w_t) = \frac{\exp\left(u_{w_{t+j}}^\top v_{w_t}\right)}{\sum_{w \in \mathcal{V}} \exp\left(u_w^\top v_{w_t}\right)}.$$

Each word has two embeddings: a "centre-word" vector $v_w$ and a "context-word" vector $u_w$. The training objective is to maximise the average log-probability of the context words across the corpus,

$$\mathcal{L} = \frac{1}{T} \sum_{t=1}^{T} \sum_{\substack{-c \le j \le c \\ j \ne 0}} \log P(w_{t+j} \mid w_t).$$

The denominator of the softmax sums over the entire vocabulary, which is prohibitively expensive for $V$ in the hundreds of thousands. The famous fix is negative sampling, which replaces the expensive softmax with a binary classification: distinguish the true context word from $K$ random "noise" words drawn from a unigram distribution $P_n(w)$ (typically the empirical unigram distribution raised to the power $3/4$, which down-weights very frequent words).

For a single (centre, context) pair $(w_t, w_{t+j})$ with $K$ negatives $\{w_k\}_{k=1}^{K} \sim P_n$, the negative-sampling objective is

$$\mathcal{L}_{\mathrm{NS}}(w_t, w_{t+j}) = \log \sigma\left(u_{w_{t+j}}^\top v_{w_t}\right) + \sum_{k=1}^{K} \mathbb{E}_{w_k \sim P_n}\!\left[ \log \sigma\left( -u_{w_k}^\top v_{w_t}\right) \right],$$

where $\sigma(z) = 1 / (1 + e^{-z})$ is the logistic function. The first term pushes the inner product of the centre vector and the true-context vector towards $+\infty$; the second pushes the inner products with the noise vectors towards $-\infty$. The number of dot products per training example is $K + 1$ rather than $V$, a saving of three to four orders of magnitude.

Derivation of the negative-sampling gradient. Differentiating $\mathcal{L}_{\mathrm{NS}}$ with respect to $v_{w_t}$:

$$\frac{\partial \mathcal{L}_{\mathrm{NS}}}{\partial v_{w_t}} = \left(1 - \sigma\left(u_{w_{t+j}}^\top v_{w_t}\right)\right) u_{w_{t+j}} - \sum_{k=1}^{K} \sigma\left(u_{w_k}^\top v_{w_t}\right) u_{w_k}.$$

Each gradient step moves $v_{w_t}$ in the direction of $u_{w_{t+j}}$ (with weight equal to $1 - \sigma(\cdot) = $ the "missing probability") and away from each noise word's $u_{w_k}$. After many such updates, words that frequently co-occur develop large positive inner products, while words that almost never co-occur develop negative inner products.

The geometry that emerges has a useful property. Pairs of vectors that stand in the same semantic relation tend to have approximately equal vector differences. The canonical example is

$$v_{\text{king}} - v_{\text{man}} + v_{\text{woman}} \approx v_{\text{queen}}.$$

This linear analogy structure is not built into the model; it is an emergent property of optimising the skip-gram objective on natural-language corpora. Exactly why it emerges is still an open theoretical question; the most influential analyses (Levy and Goldberg 2014) show that skip-gram with negative sampling implicitly factorises a shifted pointwise mutual information matrix, which goes some way to explaining the geometry but does not fully account for the analogy phenomenon.

12.4.4 GloVe

GloVe (Pennington et al. 2014) 2014 reframes the problem explicitly as matrix factorisation. Build the co-occurrence matrix $X$ where $X_{ij}$ counts how often word $j$ appears in the context of word $i$. The GloVe objective is

$$\mathcal{L}_{\mathrm{GloVe}} = \sum_{i, j} f(X_{ij}) \left( v_i^\top u_j + b_i + \tilde b_j - \log X_{ij} \right)^2,$$

where $b_i$ and $\tilde b_j$ are scalar biases and $f(x) = \min\{(x/x_{\max})^\alpha, 1\}$ is a weighting function (with $x_{\max} = 100$ and $\alpha = 0.75$) that down-weights very rare and very frequent co-occurrences. The objective trains the dot product $v_i^\top u_j$ to predict the log co-occurrence; it can be derived from a more elaborate argument about ratios of co-occurrence probabilities, which we do not reproduce here. Empirically, GloVe vectors and skip-gram vectors achieve nearly identical scores on standard intrinsic evaluations (analogy and similarity benchmarks).

12.4.5 FastText

FastText (Bojanowski et al. 2017) 2017 tackles two limitations of word2vec/GloVe. First, both treat each word type as an atom ("running", "runner", "runs" are separate vectors), so morphological information must be re-learnt for each form. Second, both are unable to embed any word that did not appear in training, which is fatal for languages with rich morphology (Finnish, Turkish, Arabic) where the long tail of inflected forms is enormous.

FastText represents each word as a bag of character n-grams of lengths 3 to 6, plus a special token for the whole word. The word "where" with $n = 3$ becomes the multiset $\{\text{\lt wh}, \text{whe}, \text{her}, \text{ere}, \text{re>}\}$ plus the whole-word marker $\text{\lt where>}$. Each n-gram has its own embedding, and the word's embedding is the sum of its n-gram embeddings. This means: morphologically related words share parameters; out-of-vocabulary words have well-defined embeddings (assemble them from their character n-grams); and the parameter count grows with the n-gram vocabulary, which is much smaller than $V \times d$ for a large $V$.

12.4.6 Limitations of static embeddings

Word2vec, GloVe, and FastText all produce static embeddings: a single vector per word type, regardless of the sentence in which the word appears. The word "bank" in "the bank of the river" gets the same vector as "the bank lent us money". This conflates senses and is a hard ceiling on downstream performance. The successor architectures (ELMo, BERT, GPT) produce contextualised embeddings, in which the vector for "bank" is computed dynamically from the entire sentence. We meet ELMo briefly in §12.10 and contextualised models proper in Chapter 13.

A second limitation, of more recent concern, is that embeddings inherit biases from the corpora they are trained on. Bolukbasi et al. (2016) showed that occupational nouns in word2vec embeddings trained on Google News had measurable gender bias: "doctor" closer to male direction, "nurse" closer to female. Debiasing techniques (Bolukbasi 2016; Manzini 2019) project out the bias subspace, but auditing for and correcting these effects is an ongoing concern, especially for high-stakes applications.

12.4.7 Evaluation of word embeddings

How do we tell whether one embedding is better than another? Two families of evaluation are standard.

Intrinsic evaluation measures geometric properties of the embedding space directly. The two most common benchmarks are similarity (correlate embedding-cosine with human-judged word-similarity scores on datasets like WordSim-353 or SimLex-999) and analogy (test the linear-analogy property on lists of curated quadruples like Athens : Greece :: Paris : France or brother : sister :: nephew : niece). Word2vec, GloVe, and FastText all score in similar ranges (Spearman $\rho$ around $0.6$–$0.75$ on similarity benchmarks; analogy accuracy of 60%–75% on the Mikolov analogy set). Intrinsic metrics are quick and reproducible but only loosely correlated with downstream task performance.

Extrinsic evaluation plugs the embeddings into a downstream system (a part-of-speech tagger, a parser, a sentiment classifier, a translation model) and measures the task metric. This is the metric that ultimately matters but is expensive and depends on the rest of the system. A finding that has held up over time is that, beyond a basic floor of quality, the choice of embedding matters less than the choice of downstream architecture, and contextualised embeddings (next chapter) dominate static ones across most extrinsic benchmarks.