Is reranking the same as RRF?

No. Reciprocal Rank Fusion combines existing rankings from multiple retrievers by summing reciprocal positions. Reranking throws away the original scores and produces new ones with a stronger model.

What is a typical reranker latency?

Managed APIs like Cohere Rerank land in the 150 to 400 ms range per call plus network. Self-hosted BGE or Jina rerankers on a small GPU can be faster, often under 100 ms for a top 50 batch.

Should I rerank if I already use hybrid search?

Maybe. Hybrid search raises top-K recall by combining keyword and vector signals. Reranking refines the order inside that top-K with a deeper model. The two stages compound. Try hybrid first, measure, then add reranking only if precision at top-5 is still weak.

What model architecture do rerankers use?

Most are cross-encoders, where the query and document are concatenated and passed through a transformer that outputs a single relevance score. Some pipelines use ColBERT style late interaction, and a few use small LLMs as judges.

How big a quality lift does reranking give?

Typically 5 to 15 NDCG@10 points on BEIR style benchmarks, which is a 10 to 30 percent relative lift. The gain is largest on noisy corpora, multilingual collections, and specialist domains, and smallest on small clean FAQ datasets.

What is Reranking? (The 2nd Stage of Modern RAG)

What reranking actually is

Reranking is the second stage of a two stage retrieval pipeline. The first stage casts a wide net using a fast model that scores millions of documents. The second stage takes the top 50 to 100 candidates and re-scores them with a slower, stronger model that reads the query and each document together. The reordered list is then truncated to the top 5 or 10 and passed to the LLM as context.

The split exists because no single model can do both jobs well. A model fast enough to score every document in a corpus is shallow. A model deep enough to judge relevance precisely is too slow to run on the whole corpus. Reranking is the engineering compromise: use the shallow model on everything, then use the deep model on a short candidate list.

Rerankers do not replace retrieval. They polish it. If the first stage misses a relevant document, no reranker can recover it. Reranking only helps when the right answer is already somewhere in the top-K.

The retrieve-then-rerank pipeline

The standard 2026 RAG stack looks like this:

Stage one, retrieval. A bi-encoder embedding model (sentence-transformers, OpenAI text-embedding-3, BGE, or a hybrid combining dense retrieval with BM25) returns the top 50 to 100 candidates. Each document was embedded once at index time. Scoring at query time is a fast vector lookup.
Stage two, reranking. A cross-encoder reads the query concatenated with each candidate document and outputs a single relevance score. This is much more accurate than dot product on independent embeddings, because the model can attend to interactions between query terms and document terms.
Stage three, generation. The top 5 to 10 reranked chunks are stuffed into the LLM prompt as context.

Why the split works: bi-encoders are fast at retrieval time because the document embeddings are precomputed. Cross-encoders cannot precompute anything, because the model needs to see the query and document together. Running a cross-encoder over a million documents at query time would take minutes. Running it over 100 candidates takes a few hundred milliseconds.

Some pipelines add late-interaction models like ColBERT as a middle stage, trading some of the cross-encoder's accuracy for much better latency. The pattern is the same: cheap models first, expensive models last.

Why reranking matters for AI chatbots

For a chatbot built on a knowledge base, retrieval quality is the ceiling on answer quality. If the right chunk does not reach the LLM, the model will either hallucinate or refuse. Reranking raises that ceiling.

The major rerankers in 2026:

Cohere Rerank 3.5 is a managed API. Multilingual coverage across more than 100 languages. Latency typically 150 to 400 ms per call plus network. The default choice when you do not want to host anything.
BGE reranker (BAAI) is open source under Apache 2.0. The bge-reranker-v2-m3 variant is the most popular multilingual option, with under 600 million parameters so it runs on a single consumer GPU. The large English variant scored 57.49 on BEIR in published benchmarks.
Jina reranker v2 base multilingual is a 278M parameter cross-encoder supporting more than 100 languages, with a 1024 token context window. Jina's published numbers show it processing roughly 15x the document throughput of bge-reranker-v2-m3.
LLM as judge reranking uses a small generation model like gpt-4o-mini or Claude Haiku to score candidates directly. Simpler to bolt on, but slower and more expensive per query than a dedicated reranker.

The lift varies by domain. On BEIR and MTEB style retrieval benchmarks, adding a reranker typically gains 5 to 15 NDCG@10 points, which translates to a 10 to 30 percent relative lift over the bi-encoder baseline. Specialist domains, noisy corpora, and multilingual collections tend to gain the most.

Cost-benefit: when reranking is worth it

Reranking is not free. Every query pays the latency of the second stage, the cost of the API call or GPU cycle, and the engineering of a second component. The honest answer to "should you rerank" depends on your corpus and your latency budget.

You probably want reranking when:

Your corpus is larger than 50,000 chunks and noisy.
Users ask multi-part or comparative questions.
You operate in multiple languages or specialist jargon.
Your evals show retrieval recall is fine but precision is poor (the right chunk is in the top 50 but not the top 5).

You probably do not want reranking when:

Your latency budget is under 500 ms end to end.
Your corpus is a small FAQ with under a few thousand entries.
You already use hybrid search and your top-5 precision is high.
Your team cannot afford another component to monitor, version, and pay for.

Reranking is also not the same as Reciprocal Rank Fusion. RRF combines multiple rankings (for example, BM25 + dense) by summing reciprocal positions. It does not rescore anything; it just merges existing scores. Reranking replaces those scores with new, stronger ones.

ChatRaj's hybrid search produces strong top-K candidates without a separate reranker for the workloads most ChatRaj sites face: a few hundred to a few thousand chunks, mostly English, predictable question shapes. A managed reranker is on the Tier-C roadmap for customers with larger or noisier corpora where evals show the precision gap is worth the added latency.

The mental model: retrieval gets the right shelf, reranking picks the right book on that shelf, the LLM reads the page. Skip reranking when the shelf is small. Add it when the shelf gets crowded.

Reranking

What reranking actually is

The retrieve-then-rerank pipeline

Why reranking matters for AI chatbots

Cost-benefit: when reranking is worth it

Common Reranking questions

Sources & further reading

Ship your first chatbot in 60 seconds.

Reranking

What reranking actually is

The retrieve-then-rerank pipeline

Why reranking matters for AI chatbots

Cost-benefit: when reranking is worth it

Related terms

Common Reranking questions

Sources & further reading

Ship your first chatbot in 60 seconds.