Matrix Factorisation, Glossary, Textbook of AI

Matrix factorisation is the foundational technique of modern collaborative filtering. Given a sparse matrix $R \in \mathbb{R}^{|U| \times |I|}$ of observed ratings (or implicit interactions) between users and items, the goal is to learn user embeddings $P \in \mathbb{R}^{|U| \times k}$ and item embeddings $Q \in \mathbb{R}^{|I| \times k}$ such that:

$$\hat r_{ui} = q_i^\top p_u$$

approximates the true rating $r_{ui}$ for any user $u$ and item $i$. The dimensionality $k$ (typically 16–256) is far smaller than $|U|$ or $|I|$, so the factorisation discovers a low-rank structure in the rating matrix: latent dimensions that might correspond to genres, themes, or stylistic axes, but are learned rather than hand-coded.

The idea entered the mainstream during the Netflix Prize (2006–2009). Simon Funk's blog post "Try this at home" introduced Funk SVD, an SGD-based factorisation that beat far more elaborate methods, and Yehuda Koren's papers (2008, 2009) extended it with biases, temporal dynamics, and implicit feedback. The full Koren model is:

$$\hat r_{ui} = \mu + b_u + b_i + q_i^\top p_u$$

where $\mu$ is the global average rating, $b_u$ is a per-user bias (some users rate higher than others) and $b_i$ is a per-item bias (some films are widely loved). The training objective is a regularised squared error on observed ratings $\mathcal{K} = \{(u, i) : r_{ui} \text{ observed}\}$:

$$\mathcal{L} = \sum_{(u,i) \in \mathcal{K}} \left(r_{ui} - \hat r_{ui}\right)^2 + \lambda \left(\|p_u\|^2 + \|q_i\|^2 + b_u^2 + b_i^2\right)$$

Two optimisers dominate. Stochastic gradient descent picks one observed rating at a time and applies the gradient updates:

$$p_u \leftarrow p_u + \eta\!\left(e_{ui} q_i - \lambda p_u\right), \qquad q_i \leftarrow q_i + \eta\!\left(e_{ui} p_u - \lambda q_i\right)$$

where $e_{ui} = r_{ui} - \hat r_{ui}$. Alternating least squares (ALS) instead fixes $Q$ and solves a closed-form ridge regression for each $p_u$, then fixes $P$ and does the same for each $q_i$. ALS parallelises trivially across users (then across items) and is the default in Apache Spark MLlib.

For implicit feedback (clicks, plays, watches), the rating matrix becomes a binary interaction matrix and the model is trained with weighted least squares (Hu, Koren, Volinsky 2008) or Bayesian personalised ranking (Rendle 2009), which optimises the probability that an observed item is ranked above an unobserved one. These implicit-feedback variants underlie most production recommenders.

Matrix factorisation has been overtaken in raw accuracy by neural collaborative filtering, two-tower retrievers, and sequential models, but it remains the conceptual core: every modern recommender still predicts via a learned dot product (or near relative) between user and item embeddings. It is the reason "embedding" is the dominant verb in industrial machine learning.

Discussed in:

Chapter 11: CNNs, Recommender Systems

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).