- Describe the architecture and training recipe of large language models and the empirical scaling laws
- Explain reinforcement learning from human feedback (RLHF) and its role in model alignment
- Outline how multimodal models such as CLIP, Flamingo, and GPT-4V combine vision and language
- Implement retrieval-augmented generation (RAG) to ground LLMs in external knowledge
- Describe autonomous AI agents and the techniques used to make inference efficient (quantisation, distillation, LoRA)
In 2020, few people outside AI research had used a language model. By 2024, hundreds of millions used them daily — to write emails, fix code, plan trips, and answer questions. The change was not one big breakthrough. It was a series of gains in scale, training, and deployment that turned lab work into products.
This chapter covers the technologies behind modern AI systems. You will learn how large language models work and why they get better with scale. You will see how alignment training turns a raw text predictor into a helpful assistant. You will explore multimodal models that combine vision and language, retrieval systems that ground models in real documents, AI agents that take actions in the world, and the efficiency techniques that make all of this practical.
15.1 Large Language Models
A large language model (LLM) is a Transformer Vaswani, 2017 trained on a massive text corpus to predict the next token. What makes LLMs different from earlier models is not a new architecture. It is scale — more parameters, more data, more compute.
The Scale Progression
- GPT-3 Brown, 2020 (OpenAI, 2020): 175 billion parameters, trained on ~300 billion tokens.
- PaLM Chowdhery, 2022 (Google): 540 billion parameters.
- LLaMA Touvron, 2023 (Meta): showed that smaller models trained on more data can match larger ones, setting a new efficiency frontier.
Scaling laws Kaplan, 2020 predict model performance as a power-law function of compute, data, and parameters. More of each reliably produces better results.
Chinchilla and Compute-Optimal Training
Hoffmann et al. (2022) Hoffmann, 2022 refined the scaling laws. They found that the number of training tokens should scale roughly in proportion to the number of parameters. By this analysis, GPT-3 was significantly under-trained — it should have seen about 3.4 trillion tokens, not 300 billion.
This insight shifted the field. Instead of just building bigger models, researchers began training moderate-sized models on much more data. LLaMA trained a 65-billion-parameter model on 1.4 trillion tokens and matched GPT-3's performance at a fraction of the inference cost.
The scaling laws also show diminishing returns. Doubling performance requires roughly a ten-fold increase in compute. Raw scaling alone will eventually become uneconomical.
The Three-Stage Training Pipeline
Modern LLMs are built in three stages:
- Pre-training: train on a large diverse corpus (web text, books, code, papers) using next-token prediction. This produces a base model with broad knowledge but no inclination to follow instructions.
- Supervised fine-tuning (SFT): train on curated instruction–response pairs. The model learns to follow instructions and give well-formatted answers.
- Alignment: refine behaviour using RLHF Ouyang, 2022 or DPO Rafailov, 2023 to match human preferences for helpfulness, harmlessness, and honesty.
This pre-train → fine-tune → align pipeline is now the standard recipe.
Emergent Capabilities
LLMs exhibit capabilities that were not explicitly trained for and that appear as models grow Wei, 2022:
- In-context learning: performing new tasks from examples in the prompt, first documented in GPT-3 Brown, 2020.
- Chain-of-thought reasoning Wei, 2022: prompting the model to "think step by step" dramatically improves performance on maths and logic tasks.
- Zero-shot and few-shot learning: adapting to tasks described in natural language, without any parameter updates.
These capabilities let LLMs serve as general-purpose reasoning engines steered through prompting alone.
Are Emergent Abilities Real?
Schaeffer, Miranda, and Koyejo (2023) Schaeffer, 2023 challenged the idea that capabilities appear as sudden phase transitions. They showed that many "emergent" jumps are artefacts of the evaluation metric. Exact-match accuracy (zero credit for any wrong digit) creates an apparent step change when the underlying improvement is smooth. With continuous metrics like token-level edit distance, the jump dissolves into gradual improvement.
This does not deny that large models can do things small ones cannot. But it cautions against mystical interpretations of scaling. "Emergence" is often a threshold defined by the metric, not a phase transition in the model.
Limitations
An LLM's knowledge comes entirely from its training data. It has no external memory, no way to verify facts, and no grounding in the physical world. Consequences:
- Hallucination: generating confident but incorrect statements.
- Bias: reflecting the biases and perspectives of the training corpus.
- Staleness: knowledge is frozen at training time.
These limitations motivate RAG for grounding, alignment training for safety, and tool use for verification — all covered later in this chapter.
Societal Impact
LLM-powered tools have boosted output in software, education, science, and creative work. But the risks are real: disinformation, academic fraud, job displacement, and concentration of power in the few organisations that can train frontier models. Navigating these trade-offs is one of the defining challenges of AI governance.
15.2 RLHF & Alignment Training
A model trained only on next-token prediction is not helpful by default. It is a statistical model of text — capable of generating anything found in its training data, including toxic and misleading content. Alignment training adjusts its behaviour to match human values and intentions.
The RLHF Pipeline
RLHF [Christiano, 2017; Ouyang, 2022] learns a reward function from human judgments and optimises the model against it. Three stages:
- Collect preferences: the fine-tuned model generates pairs of responses to the same prompt. Human annotators rank them by helpfulness, accuracy, and safety.
- Train a reward model: a Transformer trained to predict human preferences. Given a prompt and response, it outputs a scalar score. Trained with cross-entropy on pairwise preferences using the Bradley–Terry model.
- Optimise the LLM: use proximal policy optimisation (PPO) Schulman, 2017 to maximise the reward model's score. A KL divergence penalty prevents the model from drifting too far from the SFT distribution, which would cause reward hacking — exploiting quirks of the reward model rather than genuinely improving.
Direct Preference Optimisation (DPO)
DPO Rafailov, 2023 eliminates the need for an explicit reward model and the instability of RL training. It trains the language model directly on preference data using a classification-like loss. Preferred responses are made more likely; dispreferred responses less likely.
DPO is simpler, more stable, and cheaper than PPO-based RLHF. It has been widely adopted. Variants like IPO and KTO address its remaining limitations.
Constitutional AI (CAI)
Anthropic's CAI Bai, 2022 reduces reliance on human labelling. The model generates a response, then revises it according to a randomly selected principle from a "constitution" (e.g., be helpful, avoid harm, acknowledge uncertainty). It then chooses between the original and revised version. These self-generated preferences train a reward model for RLHF.
This approach cuts the annotation burden while providing a transparent, auditable set of governing principles.
Open Challenges
- Reward hacking: the model produces verbose, confident-sounding but empty responses that score well.
- Specification gaming: the reward model misses important aspects of intent.
- Diverse values: different people and cultures have conflicting preferences. Any single reward model reflects a particular group of annotators.
- Scalable oversight: how do you supervise AI systems that may exceed human capabilities in certain domains?
Why It Matters
The difference between a base model and an aligned model is dramatic. The base model may produce rambling or harmful text. The aligned model gives clear, helpful, appropriately caveated answers. This transformation is what makes modern assistants usable by the public.
15.3 Multimodal Models
Humans perceive the world through multiple senses. Multimodal AI combines two or more input types — most commonly vision and language — into a single system. The result: models that can describe images, answer questions about photos, generate images from text, and reason about what they see.
CLIP: Aligning Vision and Language
CLIP Radford, 2021 (Radford et al., 2021) trains an image encoder and a text encoder jointly on 400 million image–text pairs. The training objective is contrastive: maximise similarity between matching pairs, minimise it for non-matching pairs.
The result is a shared embedding space where images and text live together. This enables:
- Zero-shot image classification: compare an image embedding against text descriptions of each class.
- Image retrieval: search by text query.
- Text-to-image conditioning: CLIP's embeddings guide diffusion models like Stable Diffusion.
Large Multimodal Models (LMMs)
Models like GPT-4V, Gemini, and LLaVA accept both text and images as input. The typical architecture:
- A pre-trained vision encoder (often CLIP ViT) converts an image into visual tokens.
- A projection layer maps visual tokens into the language model's embedding space.
- A large language model processes the interleaved visual and textual tokens.
Training happens in two stages: first align visual and text representations (train the projection layer on image-caption data), then instruction-tune the full model on multimodal tasks.
Capabilities
Modern LMMs go far beyond captioning. They can:
- Parse complex scenes and count objects
- Read text in images (OCR)
- Interpret charts and diagrams
- Reason about spatial relationships
- Analyse screenshots and hand-drawn diagrams
Multimodal Generation
The other direction: generating images and video from text.
- Text-to-image: DALL·E 3, Stable Diffusion XL, Imagen 3 use diffusion models conditioned on text embeddings.
- Text-to-video: generating coherent video from text is much harder because of temporal consistency.
- Audio-language: models like Whisper bridge speech and text for transcription and translation.
The trend is toward "any-to-any" models that handle all modalities.
Challenges
On the technical side, different modalities have fundamentally different structures (continuous pixels vs discrete tokens). Multimodal hallucination — describing objects not in an image — is a particular problem.
On the ethical side: deepfakes, misinformation, copyright concerns when generating in a specific artist's style, and privacy when training on or generating images of real people.
15.4 Retrieval-Augmented Generation
An LLM's knowledge is frozen at training time. Ask about recent events or proprietary documents and it must either admit ignorance or hallucinate. RAG Lewis, 2020 fixes this by giving the model access to external documents at inference time.
How RAG Works
A typical pipeline has three components:
- Document store: a corpus of text chunks, each encoded as a dense vector by a pre-trained encoder (e.g., a sentence Transformer).
- Retriever: takes the user's query, encodes it, and retrieves the top-k most similar chunks via approximate nearest-neighbour search over a vector database.
- Generator: a language model that receives the query plus the retrieved chunks and produces a grounded response.
The knowledge base can be updated by adding or removing documents — no retraining required.
Retrieval Quality Is Everything
If the retriever misses relevant documents, the generator falls back to parametric knowledge or hallucinates. If it retrieves irrelevant documents, the generator may be misled.
Dense retrieval (encoding queries and documents as vectors) has largely replaced sparse methods (TF-IDF, BM25), though hybrid approaches often work best. Re-ranking with a cross-encoder model improves precision at modest cost.
Advanced RAG
Several refinements have emerged:
- Recursive retrieval: multiple rounds of retrieval, using the model's intermediate outputs to formulate new queries.
- Self-RAG (Asai et al., 2023): the model decides when retrieval is needed, evaluates retrieved documents for relevance, and assesses whether its response is supported by evidence.
- Corrective RAG: checks retrieved documents and triggers a web search if they are not relevant enough.
Chunking Matters
How you split documents into chunks has a big impact. Too small and they lack context. Too large and relevant information gets diluted. Common strategies: overlapping chunks, semantic chunking at paragraph boundaries, and hierarchical chunking (retrieve a passage plus its parent section). The choice of embedding model also matters — models trained specifically for retrieval (E5, GTE, BGE) outperform general-purpose embeddings.
Where RAG Is Used
RAG is standard in enterprise AI: answering questions about internal documents, product support, and current information. Consumer products use it too — search engines with generative summaries, research assistants grounded in academic literature. RAG is more efficient, more updatable, and more verifiable than baking ever more knowledge into ever larger models.
15.5 AI Agents
An AI agent perceives its environment, reasons about goals, plans actions, executes them (often by calling external tools), and adapts based on feedback. The concept dates back to the rational agents of Russell and Norvig Russell, 2020, but the modern version uses a large language model as its reasoning core, augmented with tool access. This is the shift from models that generate text to systems that take actions.
Agent Architecture
A typical LLM-based agent has four components:
- Language model: interprets the request, reasons about what to do, and generates plans.
- Tool-use interface: defines available tools (APIs, databases, code interpreters) and the protocol for calling them.
- Memory system: stores conversation history, previous actions and results, and retrieved context.
- Observation loop: feeds each action's result back to the model, which decides whether the goal is met or more steps are needed.
The ReAct framework Yao, 2022 formalises this as an interleaved sequence of Reasoning, Acting, and Observing.
Tool Use
Tool use transforms a text generator into an agent. You describe available tools (names, parameters, expected behaviour), and the model generates structured calls (typically JSON) that the host system executes. Common tools:
- Web search for current information
- Code execution for calculations and data analysis
- File operations for reading and writing documents
- API calls for interacting with external services
The model selects tools, formulates parameters, interprets results, and chains calls together.
Planning
Simple tasks need one tool call. Complex tasks — research a topic, write a report, format it as slides — need the agent to decompose goals, order steps, handle dependencies, and recover from errors.
Approaches include:
- Chain-of-thought: works for simple plans.
- Tree-of-thought: explores multiple branches and picks the best.
- Plan-and-execute: generate a full plan, then carry it out step by step, revising as new information arrives.
- Reflection: the agent evaluates its own performance and adjusts strategy.
Multi-Agent Systems
Multiple specialised agents can collaborate on a task. Different agents take different roles (researcher, writer, critic, coder). A supervisor routes sub-tasks and integrates results. Multi-agent debate — where agents argue positions and a judge selects the best answer — improves accuracy by surfacing errors through adversarial scrutiny.
Safety Concerns
Agents can make errors that propagate through action chains. If an agent has access to email, financial systems, or databases, mistakes can cause real harm. Essential safeguards: robust error handling, confirmation prompts for high-stakes actions, and sandboxed execution.
Broader questions remain: who is responsible when an agent causes harm? Can users understand why the agent took an action? Can users always override or halt it? As agents gain autonomy, these questions become urgent.
15.6 Efficient AI
Training a frontier LLM costs millions of dollars. Serving it at scale requires dedicated infrastructure. Efficient AI cuts these costs, making strong models more usable, more green, and able to run on more hardware — from data centres to phones.
Quantisation
Reduce the numerical precision of weights and activations to cut memory and speed up computation:
- FP32 → FP16/BF16: negligible quality loss. Most models train in mixed precision from the start.
- FP16 → INT8: halves memory with minimal quality degradation.
- 4-bit (GPTQ, AWQ, GGML): a further 2–4× memory reduction with some quality trade-off.
Transformer weights cluster near zero with a few outliers. Mixed-precision and group-wise quantisation schemes exploit this structure.
Knowledge Distillation
Transfer capabilities from a large "teacher" to a smaller "student." The student trains on the teacher's full probability distribution (soft predictions), not just the hard predictions. Hinton et al. (2015) Hinton, 2015 introduced a temperature parameter to soften the teacher's outputs, making inter-class similarities visible to the student.
Distillation can produce students 2–10× smaller that retain 90–95% of the teacher's performance. Modern approaches also distil chain-of-thought reasoning, letting smaller models emulate the reasoning of larger ones.
LoRA (Low-Rank Adaptation)
Fine-tuning all parameters of a large model is expensive. LoRA Hu, 2021 freezes the base model and injects small trainable matrices alongside each weight matrix. The update is parameterised as W = W0 + BA, where B and A have low rank r (typically 4–16).
This reduces trainable parameters by roughly 1,000×. You can fine-tune a model with tens of billions of parameters on a single consumer GPU. At inference, the adapters can be swapped for fast task switching or merged into the base weights for zero overhead.
QLoRA Dettmers, 2023 combines LoRA with 4-bit quantisation. The Hugging Face PEFT library provides standard implementations. Most open-source fine-tuned models ship as LoRA adapters on top of a base checkpoint. LoRA has turned fine-tuning frontier models from an enterprise project into a weekend project.
Pruning
Remove unnecessary parts of a trained model:
- Unstructured pruning: set individual weights to zero based on magnitude or gradient. Produces sparse matrices that need specialised hardware.
- Structured pruning: remove entire neurons, attention heads, or layers. Runs faster on standard hardware.
The lottery ticket hypothesis Frankle, 2018 showed that large networks contain small subnetworks that, trained in isolation, can match the full model's performance. This gives a basis for why pruning works.
Architecture-Level Efficiency
- Mixture-of-experts Fedus, 2021: activate only a fraction of parameters per token. Trillions of parameters at the cost of a much smaller dense model.
- Flash Attention Dao, 2022: reduces self-attention memory from quadratic to linear in sequence length through hardware-aware computation.
- Speculative decoding: parallelises parts of sequential generation to reduce latency.
- Grouped-query and multi-query attention: reduce the key–value cache, often the bottleneck when serving many concurrent users.
Why Efficiency Matters
Efficiency is not just about cost. It determines who can access powerful AI. If state-of-the-art models only run on expensive GPU clusters, the benefits concentrate in well-resourced organisations. On-device AI on phones and laptops requires aggressive compression. The planet demands that growing compute costs be offset by better methods. And in real-time applications — translation, coding assistants, autonomous vehicles — efficiency directly determines the quality of the user experience.