Embedding Layer, Glossary, Textbook of AI

Also known as: word embedding, token embedding

An embedding layer is a parameterised function $E: \{1, \ldots, V\} \to \mathbb{R}^d$ mapping discrete tokens, words, subwords, or other categorical IDs, to dense $d$-dimensional continuous vectors. Implemented as a matrix $E \in \mathbb{R}^{V \times d}$ where $V$ is the vocabulary size, lookup is

$$E(t) = E_{t,:}$$

, the $t$-th row of $E$. Equivalently, $E(t) = \text{onehot}(t)^\top E$ where $\text{onehot}(t) \in \mathbb{R}^V$ is the one-hot vector with a 1 in position $t$.

Embedding layers are the input layer of every modern language model. The embedding matrix is learned end-to-end as part of training: gradients of the loss flow back through the embedding lookup to update only the rows used in the current batch. After training, semantically similar tokens have similar embeddings, the geometric structure famous from word2vec extends to subword embeddings of modern LLMs.

Tied embeddings: in Transformers it is common to share weights between the input embedding $E$ and the output unembedding $E'$ (the matrix that maps the final hidden state to vocabulary logits). This halves the parameter count for the embedding layers and improves regularisation; modern GPT models often use this convention.

Subword tokenisation (BPE, WordPiece, Unigram, SentencePiece) reduces vocabulary size from millions of words to tens of thousands of subword units, making embedding tables practical and giving the model some compositional handling of unseen words.

Positional encoding is conceptually a separate embedding for token positions $1, \ldots, T$. Modern models use RoPE (rotary position embedding) or ALiBi rather than learned positional embeddings, but the lookup-table mechanism is the same.

For very large vocabularies, the embedding matrix can dominate parameter count. Factorised embeddings decompose $E = E_1 E_2$ for $E_1 \in \mathbb{R}^{V \times r}$, $E_2 \in \mathbb{R}^{r \times d}$ with $r < d$, the ALBERT trick. Embedding-layer compression remains an active research area for memory-constrained deployment.

Related terms: Word2Vec, Transformer, RoPE, Byte-Pair Encoding

Discussed in:

Chapter 12: Sequence Models, Sequence Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).