Transformers United, Courses, Textbook of AI

Your progress in this browser

Lectures · 0 / 12 watched

Quiz · 0 / 8 correct

Progress is stored in this browser only — there is no account, no login, and no database. Clearing your browser data will reset it.

About the course

CS25: Transformers United is a Stanford seminar that brings researchers and practitioners from across industry and academia to give one-hour talks on how the transformer architecture has reshaped their corner of AI. The seminar started in 2021 — the year after BERT and GPT-3 made it clear that one architecture would dominate the next decade of progress — and the speaker list has tracked the field's most-watched results ever since: Andrej Karpathy on the original architecture, OpenAI researchers on InstructGPT and RLHF, Anthropic on Claude, DeepMind on protein structure, the Sora team on diffusion-transformer hybrids, and so on.

The course is unusual in that it does not try to teach the maths from first principles. It assumes you already know what attention is, what an embedding is, and what a loss function does, and it spends its time instead on how the architecture gets applied in practice, what does and does not transfer between problem domains, and what the open questions are. Treat it as the seminar that sits on top of a more formal course like CS229 or CS336 — the place where you hear what the researchers themselves think is interesting.

We've curated the playlist below and added study notes for each lecture, a final quiz, and a progress tracker that records your watching and answers in this browser only. Watch in any order. The lectures are independent.

Self-assessment

A short multi-choice quiz. Click an option to commit; the correct answer and an explanation appear. Your answers are remembered in this browser.

Question 1. In a transformer, the self-attention weight $A_{ij}$ that token $i$ assigns to token $j$ is essentially proportional to:
Question 2. What is the residual stream in a transformer?
Question 3. Why does the transformer's compute and memory scale as $O(N^2)$ in the context length $N$?
Question 4. Andrej Karpathy describes the transformer as a 'general-purpose differentiable computer'. The main reason this is a useful framing is that:
Question 5. InstructGPT and the RLHF pipeline mainly change which property of the pretrained language model?
Question 6. Vision transformers process images by:
Question 7. AlphaFold's Evoformer block applies attention over:
Question 8. Mechanistic interpretability work has identified 'induction heads' in transformers. These are heads that:

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Textbook of AI

Transformers United

About the course

Watch the lectures

Syllabus

Self-assessment