Two-tower recommendation: users and items in shared embedding space, Textbook of AI

User tower and item tower learn embeddings; relevance is the dot product.

From the chapter: Chapter 17: Applications

Glossary: recommendation system, two tower

Transcript

A recommendation problem. Billions of users, millions of items. Predict whether a user will engage with an item.

The two-tower architecture. Two neural networks, one per side.

The user tower. Inputs: the user's history, profile features, demographics. Output: a user embedding vector, often 128 to 512 dimensions.

The item tower. Inputs: the item's content features, category, recent engagement. Output: an item embedding vector, the same dimension as the user.

Score user-item compatibility by the dot product of their embeddings. High dot product means relevant.

Training. Sample a positive user-item pair, say a video the user watched. Sample many negatives, items the user did not engage with. Train so positive scores exceed negatives. The cross-entropy loss with sampled negatives is the workhorse.

Once trained, item embeddings are precomputed for the entire catalogue, indexed by an approximate nearest-neighbour structure like FAISS or HNSW.

At serving time. Compute the user embedding from current context. Query the index for the top-k closest items in cosine or dot-product distance. Retrieve in milliseconds, even from billions of items.

A re-ranker, often a separate cross-encoder model, then sorts the top hundred candidates with finer features.

This two-tower retrieve, then rerank pattern, powers YouTube, TikTok, Spotify, Pinterest, and Amazon. Retrieve from billions; rerank from hundreds; show the top ten.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).