Reranking & Hybrid Retrieval
InteractiveLearn why two-stage retrieval and keyword+vector fusion improve relevance in real-world RAG systems.
Try the interactive tools (1)What Is Reranking & Hybrid Retrieval?
Think of retrieval like hiring from a huge pile of resumes.
- First pass: a recruiter quickly scans thousands of resumes and picks a short list that looks promising.
- Second pass: a hiring manager reads only that short list carefully and picks the true top candidates.
That is the core idea of two-stage retrieval.
In many AI systems, the first pass is fast search over a large corpus. It often gets you “pretty relevant” results. But when your question is nuanced, “pretty relevant” is not enough. You need a second pass that understands the query-document pair at a deeper level. That second pass is reranking.
Now add hybrid retrieval: instead of relying on only keyword search or only vector search, you combine both. Keyword search (like BM25) is great for exact terms, product codes, and rare names. Vector search is great for semantic similarity and paraphrases. Together, they usually outperform either one alone.
Technical definition:
- Hybrid retrieval combines lexical and semantic retrieval signals to produce a better candidate set.
- Reranking applies a stronger relevance model to that candidate set and reorders it so the final top-k is actually useful.
Why Does It Matter?
If retrieval is weak, your assistant fails even when the generation model is strong. RAG systems do not hallucinate only because of “model weakness” - they also hallucinate when retrieval sends weak context.
Reranking and hybrid retrieval matter because they improve:
- Answer quality: Better top documents means better grounded answers.
- Precision at top-k: The first few chunks matter most because context windows are limited.
- Robustness: Hybrid methods handle both exact lookups and fuzzy semantic questions.
- Trust: Users see fewer wrong citations and fewer “almost right” responses.
In production, this often has a measurable effect on business metrics: higher answer acceptance, fewer support escalations, and less manual correction.
How It Works
A practical pipeline is:
-
Build two retrieval channels
- Lexical index (BM25 or equivalent) over tokenized text.
- Vector index over embeddings.
-
Run both channels for each query
- Lexical retrieval returns documents with exact token overlap.
- Vector retrieval returns semantically similar chunks.
-
Fuse the candidate lists
A common method is Reciprocal Rank Fusion (RRF). Intuition: a document is valuable if it ranks well in one or more lists.
A simple version is:
RRF(d) = sum_i 1 / (k + rank_i(d))where
rank_i(d)is documentdin listi, andkis a smoothing constant (often around 60). -
Take a wider candidate set
For example, keep top 50 or top 100 fused candidates. This keeps recall high before expensive scoring.
-
Rerank with a stronger model
Use a cross-encoder or instruction-tuned reranker that reads
[query, document]jointly and outputs a relevance score.- Bi-encoders (embedding search) are fast because query and docs are encoded separately.
- Cross-encoders are slower but more accurate because they compare tokens across query and doc directly.
-
Select final context
Keep top N chunks after reranking, then apply context assembly rules (deduplicate, diversify sources, respect metadata filters).
-
Pass to generation model
Now the LLM receives fewer but stronger chunks.
Simple example
Query: “How do I rotate API keys without downtime?”
- Keyword search finds docs with exact phrase “API key rotation” and “zero downtime”.
- Vector search also finds “credential rollover” and “grace-period token migration” docs.
- Fusion combines both sets.
- Reranker pushes documents that specifically discuss migration sequence and rollback checks to the top.
Result: the final context is not just related to security - it is specifically relevant to safe rotation procedure.
Key Terminology
- Lexical retrieval (BM25): Search that relies on token overlap and term statistics.
- Vector retrieval: Search in embedding space for semantically similar content.
- Hybrid retrieval: Combining lexical and vector signals in one retrieval stack.
- Reranker: A stronger model that reorders candidates by deeper relevance.
- RRF (Reciprocal Rank Fusion): Rank-based fusion method that combines multiple ranked lists.
Real-World Applications
- Customer support copilots: Blend exact policy lookups with semantic FAQ matching, then rerank for issue-specific passages.
- Enterprise search: Combine strict keyword constraints (legal terms, IDs) with semantic retrieval across wiki pages.
- E-commerce search: Match product codes exactly while still understanding intent like “lightweight trail shoes for wet weather.”
- Code assistants: Retrieve by symbol names and function signatures, then rerank by actual implementation relevance.
Common Misconceptions
-
“Vector search makes keyword search obsolete.” Not true. Exact matching remains critical for identifiers, formulas, product SKUs, and compliance language.
-
“Reranking is optional polish.” In many systems it is a major quality lever, especially when the top few retrieved chunks determine answer quality.
-
“Hybrid retrieval is always too slow.” With practical candidate limits and batched reranking, latency is often acceptable for significant relevance gains.
Further Reading
- Cormack, Clarke, and Buettcher (2009), Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.
- Robertson and Zaragoza (2009), The Probabilistic Relevance Framework: BM25 and Beyond.
- Nogueira and Cho (2019), Passage Re-ranking with BERT.
Read the article first
These tools reinforce the concepts above — you'll get more out of them after reading through the article.
Interactive: Hybrid Retrieval Workbench
Blend lexical and semantic retrieval, inspect candidate fusion, then apply reranking to compare quality, latency, and rerank token cost.
Guided Scenario
Need the production policy window and safe migration steps for rotating API keys.
Goal: Balance exact policy terms with operational migration guidance in one query.
Left favors BM25. Right favors vector similarity.
Pool is the merged set before reranking.
Affects Recall@k and nDCG@k directly.
Uses transparent feature scoring over fused candidates.
Recall@3
75%
3 of 4 relevant docs found.
nDCG@3
1.000
Captures ranking quality using graded relevance (0-3).
Estimated latency
105 ms
Lexical + vector + fusion + optional rerank cost model.
Rerank token cost
358 tok
Query + candidate tokens scored by reranker.
Channel Retrieval
BM25 Top 5
Production API Key Rotation Policy
Customer Migration Exception Procedure
Sandbox API Key Onboarding
Rotation Audit Evidence Checklist
Credential Rollover Playbook
Vector Top 5
Customer Migration Exception Procedure
Production API Key Rotation Policy
Auth Incident Triage Runbook
Credential Rollover Playbook
Token Lifecycle Standard
Hybrid Fusion
Union from both channels, ranked by fusion score: 0.50*bm25Norm + 0.50*vectorNorm
Production API Key Rotation Policy
Customer Migration Exception Procedure
Rotation Audit Evidence Checklist
Sandbox API Key Onboarding
Auth Incident Triage Runbook
Credential Rollover Playbook
Token Lifecycle Standard
Legacy Freeze Communication
Reranked Final
Final order uses proxy reranker score.
Production API Key Rotation Policy
intent 0.36 | exact 1.00 | metadata 1.00 | contradiction -0.00
Credential Rollover Playbook
intent 0.18 | exact 1.00 | metadata 1.00 | contradiction -0.00
Customer Migration Exception Procedure
intent 0.27 | exact 0.00 | metadata 1.00 | contradiction -0.00
Comparison modes
| Mode | Recall@3 | nDCG@3 | Latency | Rerank cost |
|---|---|---|---|---|
| BM25 only | 50% | 0.856 | 28 ms | 0 tok |
| Vector only | 50% | 0.714 | 35 ms | 0 tok |
| Hybrid | 75% | 0.904 | 69 ms | 0 tok |
| Hybrid + rerank | 75% | 1.000 | 105 ms | 358 tok |
Interpretation
- BM25 and vector recall are tied here; ranking quality will depend mostly on fusion and reranking.
- Reranking improved nDCG by 9.6 points, but added 36 ms of latency.
- A larger candidate pool improves recall headroom but increases rerank compute and token cost.
- The blend is balanced between lexical precision and semantic coverage.
Continue learning
Continue directly from here instead of returning to the top navigation.