Sequential Recommendation, Glossary, Textbook of AI

Sequential recommendation treats each user's history of interactions $(i_1, i_2, \ldots, i_t)$ as an ordered sequence and predicts the next item $i_{t+1}$, in direct analogy with language modelling. Order matters: a user who has just watched three episodes of a series wants the fourth, not the first; a user who just bought a camera is in the market for lenses, not another camera.

The two canonical Transformer-based models are SASRec (Kang and McAuley, 2018) and BERT4Rec (Sun et al., 2019). Both embed item IDs and positions, and run the sequence through a stack of self-attention layers, but they differ in objective.

SASRec uses a causal (left-to-right) Transformer. Each item attends only to previous items, and the model is trained autoregressively to predict the next item. The score for candidate item $j$ at position $t+1$ is the dot product of the final hidden state $h_t$ with the item embedding $e_j$:

$$P(i_{t+1} = j \mid i_{1:t}) = \frac{\exp(h_t^\top e_j)}{\sum_{k} \exp(h_t^\top e_k)}$$

In practice this softmax is approximated with sampled softmax or BPR pairwise loss against negatives.

BERT4Rec uses a bidirectional Transformer trained with masked language modelling: random items in the sequence are replaced with a mask token, and the model predicts the originals from both left and right context. At inference, a mask token is appended to the sequence and the model fills it. Bidirectional context captures patterns that pure left-to-right models miss, such as a user oscillating between two genres, but it requires a slightly different inference protocol.

Both models use self-attention to weight past interactions, with a learned attention pattern that captures short-range bursts (binge-watching) and long-range preferences (favourite director) without the gradient pathologies of RNNs. Earlier sequential recommenders used GRUs (GRU4Rec, Hidasi et al. 2016), and Markov-chain models predate the deep-learning era. The Transformer variants typically gain 5--15% in next-item hit rate over RNN baselines on standard datasets (MovieLens, Amazon, Steam).

The connection to the rest of the recommender stack is that the user-side encoder of a two-tower retriever is increasingly a sequential model: the user is represented not by a static embedding but by the output of a Transformer over recent interactions. This is the pattern adopted by YouTube, TikTok, and Pinterest, and it is the reason "the algorithm" feels so reactive to recent behaviour. Future work blends sequential recommendation with retrieval-augmented generation: the language-model-shaped recommender directly generates item IDs as tokens, conditioned on history.

Discussed in:

Chapter 11: CNNs, Recommender Systems

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).