What reranking actually is
Reranking is the second stage of a two-stage retrieval pipeline. The first stage casts a wide net using a fast model that scores millions of documents. The second stage takes the top 50 to 100 candidates and re-scores them with a slower, stronger model that reads the query and each document together. The reordered list is then truncated to the top 5 or 10 and passed to the LLM as context.
The split exists because no single model can do both jobs well. A model fast enough to score every document in a corpus is shallow. A model deep enough to judge relevance precisely is too slow to run on the whole corpus. Reranking is the engineering compromise: use the shallow model on everything, then use the deep model on a short candidate list.
Rerankers do not replace retrieval. They polish it. If the first stage misses a relevant document, no reranker can recover it. Reranking only helps when the right answer is already somewhere in the top-K.
The retrieve-then-rerank pipeline
The standard 2026 RAG stack looks like this:
- Stage one, retrieval. A bi-encoder embedding model (sentence-transformers, OpenAI text-embedding-3, BGE, or a hybrid combining dense retrieval with BM25) returns the top 50 to 100 candidates. Each document was embedded once at index time. Scoring at query time is a fast vector lookup.
- Stage two, reranking. A cross-encoder reads the query concatenated with each candidate document and outputs a single relevance score. This is much more accurate than dot-product on independent embeddings, because the model can attend to interactions between query terms and document terms.
- Stage three, generation. The top 5 to 10 reranked chunks are stuffed into the LLM prompt as context.
Why the split works: bi-encoders are fast at retrieval time because the document embeddings are precomputed. Cross-encoders cannot precompute anything, because the model needs to see the query and document together. Running a cross-encoder over a million documents at query time would take minutes. Running it over 100 candidates takes a few hundred milliseconds.
Some pipelines add late-interaction models like ColBERT as a middle stage, trading some of the cross-encoder's accuracy for much better latency. The pattern is the same: cheap models first, expensive models last.
Why reranking matters for AI chatbots
For a chatbot built on a knowledge base, retrieval quality is the ceiling on answer quality. If the right chunk does not reach the LLM, the model will either hallucinate or refuse. Reranking raises that ceiling.
The major rerankers in 2026:
- Cohere Rerank 3.5 is a managed API. Multilingual coverage across more than 100 languages. Latency typically 150 to 400 ms per call plus network. The default choice when you do not want to host anything.
- BGE reranker (BAAI) is open source under Apache 2.0. The bge-reranker-v2-m3 variant is the most popular multilingual option, with under 600 million parameters so it runs on a single consumer GPU. The large English variant scored 57.49 on BEIR in published benchmarks.
- Jina reranker v2 base multilingual is a 278M parameter cross-encoder supporting more than 100 languages, with a 1024 token context window. Jina's published numbers show it processing roughly 15x the document throughput of bge-reranker-v2-m3.
- LLM-as-judge reranking uses a small generation model like gpt-4o-mini or Claude Haiku to score candidates directly. Simpler to bolt on, but slower and more expensive per query than a dedicated reranker.
The lift varies by domain. On BEIR and MTEB-style retrieval benchmarks, adding a reranker typically gains 5 to 15 NDCG@10 points, which translates to a 10 to 30 percent relative lift over the bi-encoder baseline. Specialist domains, noisy corpora, and multilingual collections tend to gain the most.
Cost-benefit: when reranking is worth it
Reranking is not free. Every query pays the latency of the second stage, the cost of the API call or GPU cycle, and the engineering of a second component. The honest answer to "should you rerank" depends on your corpus and your latency budget.
You probably want reranking when:
- Your corpus is larger than 50,000 chunks and noisy.
- Users ask multi-part or comparative questions.
- You operate in multiple languages or specialist jargon.
- Your evals show retrieval recall is fine but precision is poor (the right chunk is in the top 50 but not the top 5).
You probably do not want reranking when:
- Your latency budget is under 500 ms end to end.
- Your corpus is a small FAQ with under a few thousand entries.
- You already use hybrid search and your top-5 precision is high.
- Your team cannot afford another component to monitor, version, and pay for.
Reranking is also not the same as Reciprocal Rank Fusion. RRF combines multiple rankings (for example, BM25 + dense) by summing reciprocal positions. It does not rescore anything; it just merges existing scores. Reranking replaces those scores with new, stronger ones.
ChatRaj's hybrid search produces strong top-K candidates without a separate reranker for the workloads most ChatRaj sites face: a few hundred to a few thousand chunks, mostly English, predictable question shapes. A managed reranker is on the Tier-C roadmap for customers with larger or noisier corpora where evals show the precision gap is worth the added latency.
The mental model: retrieval gets the right shelf, reranking picks the right book on that shelf, the LLM reads the page. Skip reranking when the shelf is small. Add it when the shelf gets crowded.