15.14 Retrieval-augmented generation
A language model trained on a fixed corpus knows what was in that corpus and very little else. When you ask it about a paper published last week, a clinical guideline updated this morning, the contents of your own organisation's wiki, or a regulation that postdates the training cut-off, the model has three options. It can refuse, it can guess, or it can hallucinate something plausible-looking. None of these are useful. Worse, the model has no reliable way of knowing which of the three it is doing on any given query, because the only signal it has is its own internal probability distribution, and that distribution is comfortably high for confident-sounding nonsense.
Retrieval-augmented generation, or RAG, addresses this by attaching an external knowledge store to the model and letting it pull relevant documents at query time. The basic idea is older than transformers, classical question-answering systems combined a retriever with a reader for years, but the term as we now use it dates to Lewis et al.'s 2020 paper, and the practical importance has grown roughly in step with the deployment of LLMs into production. By 2026 nearly every serious LLM application that needs current information, organisation-specific knowledge, or domain expertise that was not in the training corpus uses some flavour of RAG. The patterns differ; the principle does not.
This section unpacks the pipeline, the components, and the failure modes. It assumes you have already met in-context learning in §15.10 and prepares the ground for the agent and tool-use material in §15.13. RAG can be thought of as the simplest form of tool use, one tool, the retriever, called once per query, and many of the questions it raises about grounding, evaluation and faithfulness recur, in more complex form, when an agent has dozens of tools to choose from.
The RAG pipeline
The standard RAG pipeline has three stages: index, retrieve, generate.
Index is performed once per document, ideally offline. Each document $d_i$ in the corpus is split into chunks, typically a few hundred tokens, and each chunk is fed through an embedding model to produce a vector $\mathbf{e}_{d_i} \in \mathbb{R}^n$. The choice of chunk size matters: too small and the chunk loses local context; too large and a single chunk has to encode several distinct ideas, blurring the embedding. A common compromise is 512–1024 tokens with a hundred-token overlap so that a sentence straddling a boundary is fully present in at least one chunk. The vectors are written to a vector database along with metadata, the source document, the page, the section heading, perhaps a timestamp.
Retrieve is performed once per query. The query $x$ is embedded with the same model used for the documents, producing $\mathbf{e}_x$. The vector database is then asked for the top-$k$ chunks whose embeddings are closest to $\mathbf{e}_x$ under cosine similarity or dot product. Typical $k$ is 5 to 20. The retrieval is approximate, exact nearest-neighbour search over millions of vectors is too slow for interactive use, and the index structures used to make it fast are discussed below.
Generate is the LLM call itself. The retrieved chunks are concatenated, prefixed with the query, and wrapped in a prompt that instructs the model to answer based on the supplied passages and to cite them where possible. A representative template is: "Answer the question using only the information in the passages below. If the passages do not contain the answer, say so." The cited-passage style serves two purposes: it gives the user something to verify, and it discourages the model from drifting into ungrounded generation.
Three architectural choices live above this skeleton. First, whether the retriever and the generator are jointly trained (Lewis et al. 2020 did this; almost no one does in production) or treated as independent components (the dominant pattern in 2026). Second, whether retrieval happens once at the start or repeatedly during generation, iterative or "active" retrieval, where the model emits a query, reads the result, then emits another query. Third, what gets indexed, raw text, summaries, hypothetical answers, or a knowledge graph derived from the corpus.
Embedding models
The embedding model is the single most important component of a RAG system. If it places a query and its true answer far apart in vector space, no amount of cleverness later in the pipeline will recover. The dominant families in early 2026 are OpenAI's text-embedding-3 (small and large variants, 1536 and 3072 dims), Voyage AI's voyage-3 series, Cohere's embed-v3, and on the open-weights side BGE (BAAI General Embedding), E5 from Microsoft, and Jina's embedders. Typical dimensions sit between 768 and 3072, with the trend gently upward as storage costs fall.
These models are trained with contrastive objectives. Given a query and a positive passage that answers it, the model is pushed to make their embeddings close; given the same query and a batch of unrelated passages, it is pushed to make those embeddings far. The InfoNCE loss is the standard formulation, and the quality of the negatives matters enormously, random negatives produce a model that distinguishes "biology" from "underwater welding" but cannot tell two papers on the same topic apart. Most modern embedding training pipelines mine hard negatives, often using an earlier version of the same model.
Domain-specialised embedders sometimes outperform general-purpose ones by large margins. For code, specialised embeddings learn that a function definition and a call site are related even when their textual overlap is small. For biomedical text, a model trained on PubMed knows that "myocardial infarction" and "MI" mean the same thing, and that "STEMI" is a more specific instance of both. The cost is that a domain-specialised embedder requires a fresh index whenever the underlying model is updated, which complicates rolling deployments.
Two practical considerations shape the choice. The first is whether the embedder is multilingual; if your corpus contains French, Mandarin or Arabic alongside English, you need a model that places semantically equivalent passages from different languages near each other in vector space. The second is whether the embedder supports asymmetric encoding, a different prompt or a different projection for the query side and the document side. Asymmetric encoding helps because queries and documents look very different in practice: queries are short, often interrogative, often missing context that the document provides. A symmetric encoder treats them identically, which leaves performance on the table.
Vector databases
Once you have embeddings, you need somewhere to store them and an index that supports fast nearest-neighbour search. The major options divide into in-process libraries and standalone services. FAISS (Facebook AI Similarity Search) is the in-process workhorse: a C++ library with Python bindings, used as the retrieval layer for many academic systems and a non-trivial fraction of production ones. It is a library, not a database, so it has no concept of updates, transactions or distributed storage; you build an index, save it to disk, and load it.
For larger or production deployments, Pinecone, Milvus, Qdrant, Weaviate and pgvector (a Postgres extension) provide a database-like interface, insert, update, delete, filter by metadata, over the same underlying ANN algorithms. Pinecone is fully managed; the others are open source and self-hostable.
The two index families that dominate are HNSW (Hierarchical Navigable Small World) and IVF-PQ (Inverted File with Product Quantisation). HNSW builds a multi-layer graph where each node points to a few neighbours; search descends the layers, getting closer at each step. It is fast and accurate at moderate scale (millions of vectors) but uses a lot of memory because each vector is stored in full. IVF-PQ partitions the space into Voronoi cells, then compresses vectors within each cell using product quantisation; this trades a small accuracy loss for very large memory savings, and is the choice when you have hundreds of millions or billions of vectors.
Reranking
Embedding-based retrieval is a bi-encoder approach: query and document are embedded independently, and only their dot product is used. This is fast, you can precompute every document embedding, but it discards information. The model never sees query and document together, so it cannot consider how a particular phrase in the query interacts with a particular phrase in the document.
A cross-encoder reranker fixes this. Concatenate query and document, feed them through a transformer, and output a single relevance score. Cross-encoders are slow, you must run the model once per query–document pair, so you cannot use them as a primary retriever over a million documents. But you can use them as a second stage. Retrieve top-50 with the embedder, rerank with the cross-encoder, keep the top-5. The improvement in answer quality is typically substantial and often more cost-effective than swapping in a larger generator.
Hybrid search
Dense embeddings are good at semantics, they recognise paraphrases and topical relatedness, and bad at exact strings. Ask a dense retriever for documents containing the error code ENOSPC and it may serve up passages about disk space generally while missing the chunk that mentions the exact code. Sparse retrieval methods such as BM25, the workhorse of pre-neural information retrieval, do the opposite: they reward exact term overlap, weighted by inverse document frequency, and largely ignore semantics.
Hybrid search runs both in parallel and combines the rankings. The simplest approach, reciprocal rank fusion, sums $1/(k + r_i)$ over rankers, where $r_i$ is the rank in each list and $k$ is a small constant. More sophisticated weightings learn the combination from labelled data. In benchmarks across 2024 and 2025, hybrid retrieval consistently outperformed either component alone, especially in domains rich in proper nouns, identifiers and acronyms. Legal, medical and engineering corpora gain the most.
A related family of methods, sometimes called learned sparse retrieval, fits between the two. SPLADE and its successors use a transformer to predict a sparse vector, non-zero only on a few thousand vocabulary terms, and then run a classical inverted-index search. The vectors are sparse like BM25 (so the inverted index is fast) but learned like dense embeddings (so they capture synonyms and morphological variants). For corpora where you need both speed and semantic awareness, learned sparse retrieval is now a credible third alternative to plain dense or plain BM25, and several production stacks combine all three into a single reranked pipeline.
Evaluation
You cannot improve a RAG system without measuring it, and RAG evaluation has two components.
Retrieval quality is the easier half. Given a labelled dataset of queries with their correct supporting passages, you compute recall@k, the fraction of queries for which at least one correct passage is in the top-$k$. You can also compute mean reciprocal rank, normalised discounted cumulative gain or precision@k. The hard part is getting the labels: human annotation is expensive, and synthetic question generation by an LLM tends to produce questions whose vocabulary matches the source passage too closely, flattering the retriever.
Generation quality is the harder half. You want to know whether the generated answer is correct, whether it is faithful to the retrieved passages (no hallucination beyond what the passages support), and whether it is complete. Faithfulness is now typically measured with an LLM-as-judge, a separate model is prompted to compare the answer to the retrieved passages and flag unsupported claims. Frameworks such as RAGAS automate this. The judge is itself imperfect, but it correlates well with human ratings and scales to the volumes needed for continuous evaluation.
A practical loop runs both metrics on a held-out test set after every change, new embedder, new chunk size, new reranker, and refuses to ship anything that regresses on faithfulness even if recall improves. The reason for that asymmetry is operational: a system that fails to find a relevant passage produces an unsatisfying answer, but a system that fabricates a citation produces a wrong answer that looks right, which is far more damaging.
Two further evaluation traps deserve mention. The first is contamination: if your evaluation queries leaked into the training data of the embedder or the generator, your numbers are optimistic. The second is distribution drift between evaluation and production. Synthetic queries generated from your own documents tend to be cleaner, better-spelt and more on-topic than real user queries, which include typos, half-formed thoughts and references to documents that do not exist. A small panel of real production traffic, sampled and labelled by hand, is worth more than a large synthetic test set.
What you should take away
- RAG attaches an external knowledge store to the model and retrieves relevant chunks at query time, addressing the staleness and domain-coverage problems of fixed training corpora.
- The standard pipeline is embed-retrieve-generate: index documents as vectors, embed the query, fetch top-$k$ nearest chunks, and prompt the LLM with the chunks plus the query.
- The embedding model is the lever that matters most; domain-specialised embedders frequently beat larger general-purpose ones at a fraction of the cost.
- Hybrid search combines dense embeddings with sparse BM25 to capture both semantic similarity and exact-string matching, and a cross-encoder reranker on top of the initial retrieval is usually the cheapest way to improve answer quality.
- Evaluate retrieval and generation separately, and treat faithfulness, answers grounded in retrieved passages, as a non-negotiable quality bar even when recall numbers look good.