CS25 · Stanford University · 2023
Transformers United
with Steven Feng, Div Garg, Emily Bunnapradist
Your progress in this browser
Lectures · 0 / 12 watched
Quiz · 0 / 8 correct
Progress is stored in this browser only — there is no account, no login, and no database. Clearing your browser data will reset it.
About the course
CS25: Transformers United is a Stanford seminar that brings researchers and practitioners from across industry and academia to give one-hour talks on how the transformer architecture has reshaped their corner of AI. The seminar started in 2021 — the year after BERT and GPT-3 made it clear that one architecture would dominate the next decade of progress — and the speaker list has tracked the field's most-watched results ever since: Andrej Karpathy on the original architecture, OpenAI researchers on InstructGPT and RLHF, Anthropic on Claude, DeepMind on protein structure, the Sora team on diffusion-transformer hybrids, and so on.
The course is unusual in that it does not try to teach the maths from first principles. It assumes you already know what attention is, what an embedding is, and what a loss function does, and it spends its time instead on how the architecture gets applied in practice, what does and does not transfer between problem domains, and what the open questions are. Treat it as the seminar that sits on top of a more formal course like CS229 or CS336 — the place where you hear what the researchers themselves think is interesting.
We've curated the playlist below and added study notes for each lecture, a final quiz, and a progress tracker that records your watching and answers in this browser only. Watch in any order. The lectures are independent.
Watch the lectures
Syllabus
Tick lectures as you finish them. Your ticks live in this browser only.
-
Andrej Karpathy
The original 2017 transformer architecture — tokens, embeddings, self-attention, the residual stream, and why a single architecture works for text, images, code, and protein sequences.
-
Mark Chen (OpenAI)
From GPT-2 to InstructGPT — autoregressive language modelling, the role of scale, what fine-tuning and RLHF actually change about the model's behaviour.
-
Aditya Grover
Reinforcement learning as sequence modelling — reframing offline RL as a conditional sequence-prediction problem and what is gained by doing so.
-
Lucas Beyer (Google Brain)
ViT, image patches as tokens, the scaling properties of vision transformers compared with CNNs, and where the two families converge.
-
John Jumper (DeepMind)
How the Evoformer block uses attention over the multiple sequence alignment to predict protein structure, and what AlphaFold's training objective looks like.
-
Alec Radford (OpenAI)
Whisper — encoder-decoder transformer trained on 680,000 hours of multilingual web audio, why a single model handles many languages and tasks.
-
Jared Kaplan (Anthropic)
Scaling laws for language models, the in-context learning phenomenon, what 'emergent ability' means, and why the term is contested.
-
Barret Zoph
Sparsely-activated mixture-of-experts transformers — Switch, GShard, expert routing, and the engineering of training at trillion-parameter scale.
-
Tri Dao
FlashAttention, the I/O-aware view of attention kernels, why context length is the bottleneck for many applications, and where the field is heading next.
-
Chelsea Finn (Stanford)
RT-1 and RT-2 — vision-language-action models that issue motor commands, how they generalise across tasks, the limits of imitation learning.
-
Jean-Baptiste Alayrac (DeepMind)
Flamingo — interleaving a frozen language model with a vision encoder via perceiver resamplers, few-shot prompting with images.
-
Chris Olah (Anthropic)
Induction heads, the residual stream, circuits, and the case for treating transformers as objects to be reverse-engineered rather than benchmarked.
Self-assessment
A short multi-choice quiz. Click an option to commit; the correct answer and an explanation appear. Your answers are remembered in this browser.
-
Question 1. In a transformer, the self-attention weight $A_{ij}$ that token $i$ assigns to token $j$ is essentially proportional to:
-
Question 2. What is the residual stream in a transformer?
-
Question 3. Why does the transformer's compute and memory scale as $O(N^2)$ in the context length $N$?
-
Question 4. Andrej Karpathy describes the transformer as a 'general-purpose differentiable computer'. The main reason this is a useful framing is that:
-
Question 5. InstructGPT and the RLHF pipeline mainly change which property of the pretrained language model?
-
Question 6. Vision transformers process images by:
-
Question 7. AlphaFold's Evoformer block applies attention over:
-
Question 8. Mechanistic interpretability work has identified 'induction heads' in transformers. These are heads that:
This site is currently in Beta. Contact: Chris Paton
Textbook of Usability · Textbook of Digital Health
Auckland Maths and Science Tutoring
AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).