User tower and item tower learn embeddings; relevance is the dot product.
From the chapter: Chapter 17: Applications
Glossary: recommendation system, two tower
Transcript
A recommendation problem. Billions of users, millions of items. Predict whether a user will engage with an item.
The two-tower architecture. Two neural networks, one per side.
The user tower. Inputs: the user's history, profile features, demographics. Output: a user embedding vector, often 128 to 512 dimensions.
The item tower. Inputs: the item's content features, category, recent engagement. Output: an item embedding vector, the same dimension as the user.
Score user-item compatibility by the dot product of their embeddings. High dot product means relevant.
Training. Sample a positive user-item pair, say a video the user watched. Sample many negatives, items the user did not engage with. Train so positive scores exceed negatives. The cross-entropy loss with sampled negatives is the workhorse.
Once trained, item embeddings are precomputed for the entire catalogue, indexed by an approximate nearest-neighbour structure like FAISS or HNSW.
At serving time. Compute the user embedding from current context. Query the index for the top-k closest items in cosine or dot-product distance. Retrieve in milliseconds, even from billions of items.
A re-ranker, often a separate cross-encoder model, then sorts the top hundred candidates with finer features.
This two-tower retrieve, then rerank pattern, powers YouTube, TikTok, Spotify, Pinterest, and Amazon. Retrieve from billions; rerank from hundreds; show the top ten.