← Chunking & Indexing Strategies for RAG Context Windows & Prompt Budgeting →

Reranking & Hybrid Retrieval

Interactive

Learn why two-stage retrieval and keyword+vector fusion improve relevance in real-world RAG systems.

Difficulty intermediate

Read time 10 min

Prerequisites: Embeddings & Semantic Search , Vector Databases & Approximate Nearest Neighbors (ANN) , Chunking & Indexing Strategies for RAG

reranking hybrid-retrieval bm25 vector-search rag cross-encoder retrieval

Updated February 27, 2026

What Is Reranking & Hybrid Retrieval?

Think of retrieval like hiring from a huge pile of resumes.

First pass: a recruiter quickly scans thousands of resumes and picks a short list that looks promising.
Second pass: a hiring manager reads only that short list carefully and picks the true top candidates.

That is the core idea of two-stage retrieval.

In many AI systems, the first pass is fast search over a large corpus. It often gets you “pretty relevant” results. But when your question is nuanced, “pretty relevant” is not enough. You need a second pass that understands the query-document pair at a deeper level. That second pass is reranking.

Now add hybrid retrieval: instead of relying on only keyword search or only vector search, you combine both. Keyword search (like BM25) is great for exact terms, product codes, and rare names. Vector search is great for semantic similarity and paraphrases. Together, they usually outperform either one alone.

Technical definition:

Hybrid retrieval combines lexical and semantic retrieval signals to produce a better candidate set.
Reranking applies a stronger relevance model to that candidate set and reorders it so the final top-k is actually useful.

Why Does It Matter?

If retrieval is weak, your assistant fails even when the generation model is strong. RAG systems do not hallucinate only because of “model weakness” - they also hallucinate when retrieval sends weak context.

Reranking and hybrid retrieval matter because they improve:

Answer quality: Better top documents means better grounded answers.
Precision at top-k: The first few chunks matter most because context windows are limited.
Robustness: Hybrid methods handle both exact lookups and fuzzy semantic questions.
Trust: Users see fewer wrong citations and fewer “almost right” responses.

In production, this often has a measurable effect on business metrics: higher answer acceptance, fewer support escalations, and less manual correction.

How It Works

A practical pipeline is:

Build two retrieval channels
- Lexical index (BM25 or equivalent) over tokenized text.
- Vector index over embeddings.
Run both channels for each query
- Lexical retrieval returns documents with exact token overlap.
- Vector retrieval returns semantically similar chunks.
Fuse the candidate lists

A common method is Reciprocal Rank Fusion (RRF). Intuition: a document is valuable if it ranks well in one or more lists.

A simple version is:

RRF(d) = sum_i 1 / (k + rank_i(d))

where rank_i(d) is document d in list i, and k is a smoothing constant (often around 60).
Take a wider candidate set

For example, keep top 50 or top 100 fused candidates. This keeps recall high before expensive scoring.
Rerank with a stronger model

Use a cross-encoder or instruction-tuned reranker that reads [query, document] jointly and outputs a relevance score.
- Bi-encoders (embedding search) are fast because query and docs are encoded separately.
- Cross-encoders are slower but more accurate because they compare tokens across query and doc directly.
Select final context

Keep top N chunks after reranking, then apply context assembly rules (deduplicate, diversify sources, respect metadata filters).
Pass to generation model

Now the LLM receives fewer but stronger chunks.

Simple example

Query: “How do I rotate API keys without downtime?”

Keyword search finds docs with exact phrase “API key rotation” and “zero downtime”.
Vector search also finds “credential rollover” and “grace-period token migration” docs.
Fusion combines both sets.
Reranker pushes documents that specifically discuss migration sequence and rollback checks to the top.

Result: the final context is not just related to security - it is specifically relevant to safe rotation procedure.

Key Terminology

Lexical retrieval (BM25): Search that relies on token overlap and term statistics.
Vector retrieval: Search in embedding space for semantically similar content.
Hybrid retrieval: Combining lexical and vector signals in one retrieval stack.
Reranker: A stronger model that reorders candidates by deeper relevance.
RRF (Reciprocal Rank Fusion): Rank-based fusion method that combines multiple ranked lists.

Real-World Applications

Customer support copilots: Blend exact policy lookups with semantic FAQ matching, then rerank for issue-specific passages.
Enterprise search: Combine strict keyword constraints (legal terms, IDs) with semantic retrieval across wiki pages.
E-commerce search: Match product codes exactly while still understanding intent like “lightweight trail shoes for wet weather.”
Code assistants: Retrieve by symbol names and function signatures, then rerank by actual implementation relevance.

Common Misconceptions

“Vector search makes keyword search obsolete.” Not true. Exact matching remains critical for identifiers, formulas, product SKUs, and compliance language.
“Reranking is optional polish.” In many systems it is a major quality lever, especially when the top few retrieved chunks determine answer quality.
“Hybrid retrieval is always too slow.” With practical candidate limits and batched reranking, latency is often acceptable for significant relevance gains.

Mode	Recall@3	nDCG@3	Latency	Rerank cost
BM25 only	50%	0.856	28 ms	0 tok
Vector only	50%	0.714	35 ms	0 tok
Hybrid	75%	0.904	69 ms	0 tok
Hybrid + rerank	75%	1.000	105 ms	358 tok

BM25 and vector recall are tied here; ranking quality will depend mostly on fusion and reranking.
Reranking improved nDCG by 9.6 points, but added 36 ms of latency.
A larger candidate pool improves recall headroom but increases rerank compute and token cost.
The blend is balanced between lexical precision and semantic coverage.