Why not use a cross-encoder for retrieval directly?

Too slow. A cross-encoder scores one query-document pair per forward pass and cannot precompute document vectors. Ranking a million-document corpus would mean a million forward passes per query, which is why cross-encoders sit downstream of a faster retriever instead of replacing it.

What's the difference between a cross-encoder and a bi-encoder?

A bi-encoder encodes the query and the document independently into separate vectors, then compares them with cosine or dot product. A cross-encoder concatenates them into one sequence and runs a single transformer over the pair to produce one relevance score. Bi-encoders win on speed, cross-encoders win on accuracy.

Is ColBERT a cross-encoder?

No. ColBERT is a late-interaction model. It encodes query and document into per-token vectors separately, like a bi-encoder, but computes a MaxSim score over the token grids at query time. It is a deliberate third architecture sitting between bi-encoders and cross-encoders.

What model is used for the popular ms-marco-MiniLM cross-encoder?

A 6-layer MiniLM transformer, distilled from BERT and fine-tuned on MS MARCO query and passage pairs. The L-6-v2 checkpoint reaches around 74 NDCG@10 on TREC DL 19 at roughly 1800 documents per second on a V100 GPU, which is why it remains the default reranker for many RAG stacks.

Are commercial rerankers like Cohere Rerank cross-encoders?

Effectively yes. The Cohere Rerank API and similar managed services expose cross-encoder style scoring behind an HTTP endpoint. You send a query plus a candidate list, you receive a reranked list with relevance scores, and the underlying model reads each pair jointly.

What is a Cross-Encoder? (vs Bi-Encoder, Explained)

What a cross-encoder actually is

A cross-encoder is a single transformer model that scores how relevant a document is to a query by reading both at the same time. You feed it one sequence shaped like [CLS] query [SEP] document [SEP], run a forward pass, and read a single number off a regression head. That number is the relevance score.

The "cross" in cross-encoder refers to attention. Every query token can attend to every document token, and vice versa, inside the same transformer stack. The model sees the pair jointly, so it can notice that "apple" in the query matches "fruit basket" in the document but not "Apple Inc." in a different document. That joint attention is what gives the architecture its accuracy edge over independent encoders.

Architecturally, cross-encoders are usually BERT style encoder-only transformers with a regression head bolted on top. The popular open-source family is hosted under the cross-encoder/ namespace on Hugging Face, with checkpoints distilled from BERT and fine-tuned on MS MARCO query and passage pairs.

Cross-encoder vs bi-encoder vs late-interaction

Three architectures dominate modern neural retrieval. They differ in where and how query and document representations meet.

A bi-encoder runs query and document through the encoder separately and produces one vector per side. Similarity is a dot product or cosine. Document vectors can be precomputed and indexed in a vector database, so search is fast: one query encoding plus a nearest-neighbour lookup. The cost is accuracy. The model never sees the two sides together and has to compress everything that might matter into a single vector. This is the engine behind most dense retrieval systems and most embedding model APIs.

A cross-encoder runs query and document through the encoder together and produces a single relevance score. Document vectors cannot be precomputed because the score depends on the query, so retrieval over a million-document corpus would mean a million forward passes per query. Accuracy is high. Throughput is low.

A late-interaction model like ColBERT sits in the middle. It encodes query and document separately into per-token vectors, then computes a MaxSim score at query time over the token grids. You get most of the cross-encoder accuracy with index-friendly precomputation, at the cost of a larger index.

A common shorthand: bi-encoders are fast and dumb, cross-encoders are slow and smart, late-interaction is a compromise that buys part of the smart with a bigger index.

Why cross-encoders matter for AI chatbots

A retrieval-augmented chatbot lives or dies on what it puts in the LLM's context window. If the top three chunks are off-topic, the model either hallucinates a confident answer or politely refuses. A bi-encoder alone often misses the right chunk in the top three even when it finds it in the top fifty. The query "how do I cancel auto-renewal on the Pro plan" might pull in a chunk about "auto-renewal billing FAQ" but rank a chunk titled "Pro plan downgrade steps" lower, because cosine similarity in 384 dimensions is a blunt instrument.

A cross-encoder reads both sides together and notices the missing "cancel" verb, the "Pro" qualifier, and the procedural intent. It typically reorders those fifty candidates so the right answer surfaces at position one or two. On TREC DL 19, the ms-marco-MiniLM-L-6-v2 reranker reaches around 74 NDCG@10 at roughly 1800 documents per second on a V100. That throughput is fine for fifty candidates per query. It would be ruinous for a million.

ChatRaj's reranking option uses a cross-encoder pattern, scoring the top 50 candidates from hybrid retrieval and keeping the top 6 for the LLM. The hybrid retriever (BM25 plus a dense bi-encoder) is the wide funnel; the cross-encoder is the precision filter at the bottom.

When to use a cross-encoder (almost always: reranking)

The rule of thumb is simple. Use a cross-encoder when the candidate set is already small. Use a bi-encoder when it is not.

Concrete shapes:

Reranking the top K of a faster retriever. This is the default and the reason every production RAG stack ends up with one. The retriever returns 25 to 100 candidates, the cross-encoder rescores them, and the top 5 to 10 go to the LLM.
Pairwise question similarity at small scale. Deduplicating a hundred-question FAQ, or matching a user query against a fixed list of canned answers.
Last mile relevance for high stakes queries. Legal, medical, or compliance searches where missing the right paragraph is expensive and an extra 200 ms is not.

Where a cross-encoder is the wrong tool: anywhere you would otherwise scan the full corpus. You cannot precompute scores, you cannot index them, and you cannot shard the work across query and document independently. The compute scales linearly with corpus size per query, which is exactly the property that makes vector search worth doing.

Modern checkpoints worth knowing: cross-encoder/ms-marco-MiniLM-L-6-v2 for a fast, well-supported default; the BGE reranker family for stronger multilingual coverage; and the Cohere Rerank API for a managed cross-encoder behind an HTTP endpoint. The BEIR benchmark paper from Thakur and colleagues showed cross-encoder reranking achieving the highest empirical nDCG@10 across most tasks, with the well-known caveat about computational cost. That trade-off is the whole reason this architecture has a specific job in the pipeline rather than the only job.

Cross-encoder

What a cross-encoder actually is

Cross-encoder vs bi-encoder vs late-interaction

Why cross-encoders matter for AI chatbots

When to use a cross-encoder (almost always: reranking)

Common Cross-encoder questions

Sources & further reading

Ship your first chatbot in 60 seconds.

Cross-encoder

What a cross-encoder actually is

Cross-encoder vs bi-encoder vs late-interaction

Why cross-encoders matter for AI chatbots

When to use a cross-encoder (almost always: reranking)

Related terms

Common Cross-encoder questions

Sources & further reading

Ship your first chatbot in 60 seconds.