Also known as: RAG
Retrieval-Augmented Generation (RAG) addresses a fundamental limitation of large language models: their knowledge is static, encoded entirely in parameters at training time, and cannot be updated without retraining. When asked about recent events or domain-specific information not in the training corpus, an LLM will often hallucinate a plausible but fabricated answer. RAG mitigates this by fetching relevant documents from an external knowledge base and including them in the model's context.
A typical RAG pipeline has three components. A document store holds text chunks (paragraphs, pages, passages), each represented as a dense vector embedding from a pretrained encoder. A retriever encodes the user's query into the same embedding space and fetches the top-$k$ most similar chunks via approximate nearest-neighbour search over a vector database (Pinecone, Weaviate, Qdrant, FAISS). A generator (the LLM) receives the query along with the retrieved chunks and produces a response grounded in the evidence.
RAG enables LLMs to answer questions about proprietary documents, internal knowledge bases, and information newer than the training cutoff—without retraining. The quality depends critically on retrieval quality: poor retrieval dooms the generator to hallucination or misinformation. Hybrid retrieval combining dense embeddings with sparse methods (BM25), re-ranking with cross-encoders, adaptive chunking strategies, and self-assessment techniques like Self-RAG all improve performance. RAG has become a standard component of enterprise AI and increasingly powers consumer products: search engines with generative summaries, customer support chatbots, research assistants, and more.
Related terms: Large Language Model, Embedding, Hallucination
Discussed in:
- Chapter 15: Modern AI — Retrieval-Augmented Generation
Also defined in: Textbook of AI