Retrieval-Augmented Generation, Glossary, Textbook of AI

Also known as: RAG

Retrieval-Augmented Generation (RAG), introduced by Lewis et al. at Facebook AI Research in 2020, is the architectural pattern of augmenting a generative language model with a retrieval system that fetches relevant documents from a corpus and conditions generation on them. A typical RAG pipeline: (1) Embed the user query using a dense embedding model; (2) Retrieve top-k documents from a vector database by similarity; (3) Concatenate retrieved documents into the prompt; (4) Generate the response with the augmented prompt.

RAG addresses two key limitations of pure parametric language models: limited training-time knowledge that becomes stale and cannot be updated without retraining, and the inability to cite specific sources for factual claims. RAG-augmented systems can pull in fresh information at query time, ground their responses in retrievable evidence, and provide citations.

RAG is now standard in production LLM systems. Variations and refinements abound: Hybrid retrieval (combining dense embedding-based retrieval with sparse BM25-style retrieval); Query rewriting (LLM rewrites the user query to improve retrieval); Re-ranking (a small re-ranker model orders retrieved documents); Chain-of-RAG / iterative retrieval (multiple rounds of retrieval guided by intermediate model outputs); GraphRAG (Microsoft, 2024), uses a knowledge graph derived from the corpus rather than flat documents.

Most modern conversational AI products, ChatGPT's browsing mode, Claude's web search, Perplexity, Google's AI Overviews, combine LLMs with RAG-style retrieval. The combination is often more reliable than pure-LLM approaches for factual queries with rapidly-changing information.

Video

Related terms: patrick-lewis, FAISS, Prompt Engineering

Discussed in:

Chapter 15: Modern AI, Modern AI

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.