GPT (Generative Pre-trained Transformer), introduced by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever at OpenAI in 2018, is a decoder-only autoregressive Transformer pre-trained on language modelling and fine-tuned for downstream tasks. The architecture predicts each token given all previous tokens, with causal masking ensuring no information flows from future to past.
Successive generations have scaled up dramatically: GPT-1 (2018), 117M parameters, 7000 books training corpus. GPT-2 (2019), 1.5B parameters, 40GB of WebText. OpenAI initially withheld release citing misuse concerns. GPT-3 (2020), 175B parameters, 300B training tokens. The first language model with strong few-shot in-context-learning performance, often able to perform new tasks from a few examples in the prompt without any gradient update. GPT-3.5 / ChatGPT (2022), fine-tuned with RLHF to follow conversational instructions; the consumer product that brought LLMs to mass attention. GPT-4 (2023), multimodal, substantial reasoning improvements over GPT-3.5; technical details largely undisclosed. GPT-4o / GPT-4-turbo (2024), efficiency improvements; native multimodal. o1 / o3 (2024–), reasoning models trained with reinforcement learning to use extended chain-of-thought.
The decoder-only autoregressive Transformer paradigm proved highly general, it can be applied to any sequential data, from natural language to code to protein sequences to audio tokens. The GPT line is the dominant architectural lineage of the modern LLM era, with Claude, Gemini, LLaMA, Mistral, DeepSeek and others following the same basic blueprint.
Video
Related terms: alec-radford, Transformer, BERT, Language Model
Discussed in:
- Chapter 13: Attention & Transformers, Attention and Transformers