Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013, learns dense vector representations of words—word embeddings—from large text corpora. It offers two training objectives: Continuous Bag of Words (CBOW), which predicts a target word from its surrounding context, and Skip-gram, which predicts the surrounding context from a target word. Both train a shallow neural network whose hidden-layer weights become the embedding vectors, typically 100 to 300 dimensions.
The remarkable discovery was that the learned vectors exhibited regular algebraic structure. The most famous example is the analogy vec("king") − vec("man") + vec("woman") ≈ vec("queen"), showing that semantic relationships become geometric directions in the embedding space. Similar regularities held for verb tenses, capital cities, and part-of-speech relationships. This demonstrated that distributional semantics—the idea that words appearing in similar contexts have similar meanings—could be captured compactly by vectors.
Word2Vec was trained with negative sampling, an efficient approximation that selects a few "negative" words at each step instead of computing softmax over the entire vocabulary. GloVe (Pennington et al., 2014) offered an alternative approach by factorising a global word-word co-occurrence matrix. Static word embeddings like Word2Vec and GloVe have largely been supplanted by contextual embeddings from models like ELMo, BERT, and GPT that assign different vectors to the same word in different contexts. But Word2Vec remains historically significant and pedagogically valuable for understanding the geometry of meaning in vector space.
Discussed in:
- Chapter 12: Sequence Models — Word Embeddings
Also defined in: Textbook of AI