GloVe, Glossary, Textbook of AI

GloVe (Global Vectors for word representation) is a method for learning dense vector embeddings of words, introduced by Jeffrey Pennington, Richard Socher and Christopher Manning in 2014 at Stanford NLP, in the EMNLP paper GloVe: Global Vectors for Word Representation. Where word2vec is a prediction-based method (predict context words from a target word, or vice versa, via a sliding window), GloVe operates directly on the global word–word co-occurrence matrix of a corpus, factorising it to produce embeddings whose dot products approximate the logarithm of co-occurrence probabilities.

Objective

Let $X_{ij}$ denote the number of times word $j$ appears in the context of word $i$ in the training corpus, and let $X_i = \sum_k X_{ik}$ and $P_{ij} = X_{ij} / X_i$ be the co-occurrence probability. GloVe learns word vectors $w_i$ and context vectors $\tilde{w}_j$ (with biases $b_i, \tilde{b}_j$) by minimising the weighted least-squares objective

$$J = \sum_{i,j=1}^{V} f(X_{ij}) \left( w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2,$$

where the weighting function $f(x) = \min(1, (x/x_{\max})^\alpha)$ (typically $x_{\max} = 100$, $\alpha = 0.75$) downweights both very rare and very frequent co-occurrences. The motivating derivation in the paper shows that the ratio of co-occurrence probabilities $P_{ik}/P_{jk}$ , not the probabilities themselves, encodes meaningful semantic distinctions, and GloVe's loss is constructed so that vector arithmetic recovers these ratios.

Properties and benchmarks

GloVe combines the advantages of count-based methods (efficient use of global corpus statistics) with those of prediction-based methods (smooth, generalising representations and the famous vector arithmetic semantics, in which $\mathrm{vec}(\text{king}) - \mathrm{vec}(\text{man}) + \mathrm{vec}(\text{woman}) \approx \mathrm{vec}(\text{queen})$). At publication GloVe produced state-of-the-art results on word-analogy, word-similarity (WordSim-353, SimLex-999) and named-entity-recognition benchmarks, edging out word2vec on several tasks and matching it on others. The pre-trained Stanford releases, 50, 100, 200 and 300-dimensional vectors trained on Wikipedia + Gigaword (6B tokens) and Common Crawl (42B and 840B tokens), became one of the most widely downloaded resources in NLP.

Status

GloVe and word2vec are typically interchangeable in practice; the choice between them is usually decided by which pre-trained vectors happen to match the target domain. Both have been substantially displaced by contextual embeddings (ELMo 2018, BERT 2018, and the entire Transformer family) since 2018, since contextual embeddings produce a different vector for each occurrence of a word, naturally handling polysemy and syntactic role. GloVe remains widely used as fast, cheap initial features for low-resource settings, as a sanity-check baseline, and as the natural choice when training a contextual model is computationally infeasible.

Related terms: Word2Vec, BERT

Discussed in:

Chapter 8: Unsupervised Learning, Word Embeddings

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).