Re-ranking is the standard production trick for high-quality RAG: retrieve broadly, then filter precisely. A fast first-stage retriever returns ~50–100 candidates by vector or BM25 search; a slower cross-encoder then rescores each candidate against the query and keeps the top 3–5.
Why two stages
A bi-encoder (the standard embedding model) encodes query and document independently:
$$\text{score}_{\text{bi}}(q, d) = \cos(\mathbf{e}(q), \mathbf{e}(d))$$
This is fast (you precompute $\mathbf{e}(d)$ for every document) but loses the ability to attend to query–document interactions.
A cross-encoder concatenates query and document and runs them jointly through a transformer, attending across both:
$$\text{score}_{\text{cross}}(q, d) = f_\theta([CLS]\, q\, [SEP]\, d)$$
Cross-encoders score 5–15 points higher on retrieval benchmarks but cost ~100× more per pair, feasible only on top-100, not on millions of documents.
The two-stage pipeline
User query
↓
1. Bi-encoder retrieval (vector + BM25, top-100 candidates) ← fast, cheap
↓
2. Cross-encoder re-ranking of those 100 ← slow, accurate
↓
Top-5 chunks fed to the LLM
Major rerankers (2025)
| Model | Type | Notes |
|---|---|---|
| Cohere Rerank 3.5 | Hosted API | Strong default, multilingual |
| Voyage Rerank-2 | Hosted | Domain-tuned variants |
| BGE Reranker v2-m3 | Open weights | SOTA open, multilingual |
| Jina Reranker v2 | Open / hosted | 8k context |
| Mixedbread mxbai-rerank | Open weights | Strong English |
| MonoT5 | Classic | Generates "true"/"false" tokens |
| ColBERT v2 / ColPali | Late interaction | Sub-token-level scoring |
Pseudocode
candidates = vector_db.search(embed(query), top_k=100)
pairs = [(query, c.text) for c in candidates]
scores = cohere.rerank(model="rerank-english-v3.0", query=query, documents=[c.text for c in candidates])
top_5 = [candidates[s.index] for s in scores.results[:5]]
Empirical impact
On benchmarks like BEIR, MS MARCO, and FiQA, adding a cross-encoder reranker over a bi-encoder retrieval typically improves NDCG@10 by 8–20 percentage points. On real-world enterprise RAG, the improvement in answer correctness is often the difference between unusable and deployable.
Late interaction (ColBERT)
A middle ground is late interaction models (ColBERT, ColBERTv2, ColPali). Instead of compressing the document to one vector or doing full cross-attention, they keep one vector per token and compute a MaxSim score:
$$\text{score}(q, d) = \sum_{i \in q} \max_{j \in d} \mathbf{e}_i(q) \cdot \mathbf{e}_j(d)$$
This is far cheaper than a cross-encoder, almost as accurate, and works well at scale.
When not to rerank
- Latency-critical apps (<100 ms budget).
- Already-strong retrieval (rerank ROI is small).
- Very small candidate sets (<10 docs total).
Related terms: Retrieval-Augmented Generation, Agentic RAG, Vector Database, Embeddings APIs
Discussed in:
- Chapter 15: Modern AI, Modern AI