Re-Ranking, Glossary, Textbook of AI

Re-ranking is the standard production trick for high-quality RAG: retrieve broadly, then filter precisely. A fast first-stage retriever returns ~50–100 candidates by vector or BM25 search; a slower cross-encoder then rescores each candidate against the query and keeps the top 3–5.

Why two stages

A bi-encoder (the standard embedding model) encodes query and document independently:

$$\text{score}_{\text{bi}}(q, d) = \cos(\mathbf{e}(q), \mathbf{e}(d))$$

This is fast (you precompute $\mathbf{e}(d)$ for every document) but loses the ability to attend to query–document interactions.

A cross-encoder concatenates query and document and runs them jointly through a transformer, attending across both:

$$\text{score}_{\text{cross}}(q, d) = f_\theta([CLS]\, q\, [SEP]\, d)$$

Cross-encoders score 5–15 points higher on retrieval benchmarks but cost ~100× more per pair, feasible only on top-100, not on millions of documents.

The two-stage pipeline

User query
    ↓
1. Bi-encoder retrieval (vector + BM25, top-100 candidates)  ← fast, cheap
    ↓
2. Cross-encoder re-ranking of those 100                     ← slow, accurate
    ↓
Top-5 chunks fed to the LLM

Major rerankers (2025)

Model	Type	Notes
Cohere Rerank 3.5	Hosted API	Strong default, multilingual
Voyage Rerank-2	Hosted	Domain-tuned variants
BGE Reranker v2-m3	Open weights	SOTA open, multilingual
Jina Reranker v2	Open / hosted	8k context
Mixedbread mxbai-rerank	Open weights	Strong English
MonoT5	Classic	Generates "true"/"false" tokens
ColBERT v2 / ColPali	Late interaction	Sub-token-level scoring

Pseudocode

candidates = vector_db.search(embed(query), top_k=100)
pairs = [(query, c.text) for c in candidates]
scores = cohere.rerank(model="rerank-english-v3.0", query=query, documents=[c.text for c in candidates])
top_5 = [candidates[s.index] for s in scores.results[:5]]

Empirical impact

On benchmarks like BEIR, MS MARCO, and FiQA, adding a cross-encoder reranker over a bi-encoder retrieval typically improves NDCG@10 by 8–20 percentage points. On real-world enterprise RAG, the improvement in answer correctness is often the difference between unusable and deployable.

Late interaction (ColBERT)

A middle ground is late interaction models (ColBERT, ColBERTv2, ColPali). Instead of compressing the document to one vector or doing full cross-attention, they keep one vector per token and compute a MaxSim score:

$$\text{score}(q, d) = \sum_{i \in q} \max_{j \in d} \mathbf{e}_i(q) \cdot \mathbf{e}_j(d)$$

This is far cheaper than a cross-encoder, almost as accurate, and works well at scale.

When not to rerank

Latency-critical apps (<100 ms budget).
Already-strong retrieval (rerank ROI is small).
Very small candidate sets (<10 docs total).

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).