ChatRaj
Architecture & chunking

Reranking

Reranking is a second-stage scoring pass that re-orders the top 50 to 100 candidates returned by initial retrieval.

Bottom line
Reranking is a second-stage scoring pass that re-orders the top 50 to 100 candidates returned by initial retrieval. A stronger model, usually a cross-encoder, gives full attention to each query and document pair, then the top 5 to 10 are sent to the LLM. The lift is typically 10 to 30 percent NDCG, at the cost of 100 to 300 ms per call.
Reviewed by ··5 min read
Jump to section

What reranking actually is

Reranking is the second stage of a two-stage retrieval pipeline. The first stage casts a wide net using a fast model that scores millions of documents. The second stage takes the top 50 to 100 candidates and re-scores them with a slower, stronger model that reads the query and each document together. The reordered list is then truncated to the top 5 or 10 and passed to the LLM as context.

The split exists because no single model can do both jobs well. A model fast enough to score every document in a corpus is shallow. A model deep enough to judge relevance precisely is too slow to run on the whole corpus. Reranking is the engineering compromise: use the shallow model on everything, then use the deep model on a short candidate list.

Rerankers do not replace retrieval. They polish it. If the first stage misses a relevant document, no reranker can recover it. Reranking only helps when the right answer is already somewhere in the top-K.

The retrieve-then-rerank pipeline

The standard 2026 RAG stack looks like this:

  1. Stage one, retrieval. A bi-encoder embedding model (sentence-transformers, OpenAI text-embedding-3, BGE, or a hybrid combining dense retrieval with BM25) returns the top 50 to 100 candidates. Each document was embedded once at index time. Scoring at query time is a fast vector lookup.
  2. Stage two, reranking. A cross-encoder reads the query concatenated with each candidate document and outputs a single relevance score. This is much more accurate than dot-product on independent embeddings, because the model can attend to interactions between query terms and document terms.
  3. Stage three, generation. The top 5 to 10 reranked chunks are stuffed into the LLM prompt as context.

Why the split works: bi-encoders are fast at retrieval time because the document embeddings are precomputed. Cross-encoders cannot precompute anything, because the model needs to see the query and document together. Running a cross-encoder over a million documents at query time would take minutes. Running it over 100 candidates takes a few hundred milliseconds.

Some pipelines add late-interaction models like ColBERT as a middle stage, trading some of the cross-encoder's accuracy for much better latency. The pattern is the same: cheap models first, expensive models last.

Why reranking matters for AI chatbots

For a chatbot built on a knowledge base, retrieval quality is the ceiling on answer quality. If the right chunk does not reach the LLM, the model will either hallucinate or refuse. Reranking raises that ceiling.

The major rerankers in 2026:

  • Cohere Rerank 3.5 is a managed API. Multilingual coverage across more than 100 languages. Latency typically 150 to 400 ms per call plus network. The default choice when you do not want to host anything.
  • BGE reranker (BAAI) is open source under Apache 2.0. The bge-reranker-v2-m3 variant is the most popular multilingual option, with under 600 million parameters so it runs on a single consumer GPU. The large English variant scored 57.49 on BEIR in published benchmarks.
  • Jina reranker v2 base multilingual is a 278M parameter cross-encoder supporting more than 100 languages, with a 1024 token context window. Jina's published numbers show it processing roughly 15x the document throughput of bge-reranker-v2-m3.
  • LLM-as-judge reranking uses a small generation model like gpt-4o-mini or Claude Haiku to score candidates directly. Simpler to bolt on, but slower and more expensive per query than a dedicated reranker.

The lift varies by domain. On BEIR and MTEB-style retrieval benchmarks, adding a reranker typically gains 5 to 15 NDCG@10 points, which translates to a 10 to 30 percent relative lift over the bi-encoder baseline. Specialist domains, noisy corpora, and multilingual collections tend to gain the most.

Cost-benefit: when reranking is worth it

Reranking is not free. Every query pays the latency of the second stage, the cost of the API call or GPU cycle, and the engineering of a second component. The honest answer to "should you rerank" depends on your corpus and your latency budget.

You probably want reranking when:

  • Your corpus is larger than 50,000 chunks and noisy.
  • Users ask multi-part or comparative questions.
  • You operate in multiple languages or specialist jargon.
  • Your evals show retrieval recall is fine but precision is poor (the right chunk is in the top 50 but not the top 5).

You probably do not want reranking when:

  • Your latency budget is under 500 ms end to end.
  • Your corpus is a small FAQ with under a few thousand entries.
  • You already use hybrid search and your top-5 precision is high.
  • Your team cannot afford another component to monitor, version, and pay for.

Reranking is also not the same as Reciprocal Rank Fusion. RRF combines multiple rankings (for example, BM25 + dense) by summing reciprocal positions. It does not rescore anything; it just merges existing scores. Reranking replaces those scores with new, stronger ones.

ChatRaj's hybrid search produces strong top-K candidates without a separate reranker for the workloads most ChatRaj sites face: a few hundred to a few thousand chunks, mostly English, predictable question shapes. A managed reranker is on the Tier-C roadmap for customers with larger or noisier corpora where evals show the precision gap is worth the added latency.

The mental model: retrieval gets the right shelf, reranking picks the right book on that shelf, the LLM reads the page. Skip reranking when the shelf is small. Add it when the shelf gets crowded.

FAQ

Common Reranking questions

No. Reciprocal Rank Fusion combines existing rankings from multiple retrievers by summing reciprocal positions. Reranking throws away the original scores and produces new ones with a stronger model.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML