The two-tower recommender is the dominant industrial architecture for large-scale candidate retrieval. It consists of two neural networks, the user tower $f_U(x_u; \theta_U)$ and the item tower $f_I(x_i; \theta_I)$, each producing an embedding in $\mathbb{R}^d$. The score for a (user, item) pair is the inner product:
$$s(u, i) = f_U(x_u)^\top f_I(x_i)$$
The towers are trained jointly but, crucially, never interact until the final dot product. This factorisation is what makes the architecture practical: at serving time the item tower is run offline over the catalogue, the resulting embeddings are loaded into a vector index (FAISS, ScaNN, HNSW), and at request time only the user tower is run, followed by an approximate nearest-neighbour lookup. Recommending the top-$K$ items from a billion-item catalogue takes milliseconds.
The dominant training objective is sampled softmax. Given a positive pair $(u, i^+)$ from the log, the loss is:
$$\mathcal{L} = -\log \frac{\exp(s(u, i^+))}{\exp(s(u, i^+)) + \sum_{i^- \in \mathcal{N}_u} \exp(s(u, i^-))}$$
where $\mathcal{N}_u$ is a set of negatives. In-batch negatives, where the other items in the same training batch serve as negatives for each user, are common because they are free, but they bias toward popular items; production systems mix in uniformly sampled negatives or apply logQ correction to debias. Contrastive losses (InfoNCE, triplet) are alternatives.
Each tower can be arbitrarily complex. The user tower typically ingests a sequence of recent interactions (videos watched, songs played, products clicked), embedded and pooled by an MLP, transformer, or RNN, plus user features (country, language, device, age range). The item tower ingests content features (title, genre, creator, audio embedding, thumbnail embedding) and item ID embedding. Because content features are present, the item tower generalises to items with no interaction history: the cold-start problem is mitigated.
Two-tower retrievers underpin the recommendation systems of YouTube (Covington, Adams, Sargin 2016 was the first published variant, with refinements in Yi et al. 2019), TikTok, Spotify, Pinterest, and almost every large e-commerce site. They are typically the first stage in a multi-stage cascade: retrieve a few thousand candidates with the two-tower model, then re-rank with a heavier cross-encoder or feature-rich gradient-boosted tree that can model user-item interactions without the dot-product factorisation constraint. The two-tower architecture is the practical realisation of the matrix-factorisation idea: a learned user--item dot product, but with deep encoders, content features, and inference-time scalability.
Related terms: Matrix Factorisation, Neural Collaborative Filtering, Sequential Recommendation, Transformer
Discussed in:
- Chapter 11: CNNs, Recommender Systems