ChatRaj
Architecture & chunking

Cross-encoder

A cross-encoder is a transformer that takes a query and a document concatenated together as a single input and outputs one relevance score.

Bottom line
A cross-encoder is a transformer that takes a query and a document concatenated together as a single input and outputs one relevance score. Because the query tokens can attend directly to the document tokens, accuracy is high. The catch is speed: one forward pass per pair, so cross-encoders are used to rerank a small candidate set, not to search a corpus.
Reviewed by ··5 min read
Jump to section

What a cross-encoder actually is

A cross-encoder is a single transformer model that scores how relevant a document is to a query by reading both at the same time. You feed it one sequence shaped like [CLS] query [SEP] document [SEP], run a forward pass, and read a single number off a regression head. That number is the relevance score.

The "cross" in cross-encoder refers to attention. Every query token can attend to every document token, and vice versa, inside the same transformer stack. The model sees the pair jointly, so it can notice that "apple" in the query matches "fruit basket" in the document but not "Apple Inc." in a different document. That joint attention is what gives the architecture its accuracy edge over independent encoders.

Architecturally, cross-encoders are usually BERT-style encoder-only transformers with a regression head bolted on top. The popular open-source family is hosted under the cross-encoder/ namespace on Hugging Face, with checkpoints distilled from BERT and fine-tuned on MS MARCO query and passage pairs.

Cross-encoder vs bi-encoder vs late-interaction

Three architectures dominate modern neural retrieval. They differ in where and how query and document representations meet.

A bi-encoder runs query and document through the encoder separately and produces one vector per side. Similarity is a dot product or cosine. Document vectors can be precomputed and indexed in a vector database, so search is fast: one query encoding plus a nearest-neighbour lookup. The cost is accuracy. The model never sees the two sides together and has to compress everything that might matter into a single vector. This is the engine behind most dense retrieval systems and most embedding model APIs.

A cross-encoder runs query and document through the encoder together and produces a single relevance score. Document vectors cannot be precomputed because the score depends on the query, so retrieval over a million-document corpus would mean a million forward passes per query. Accuracy is high. Throughput is low.

A late-interaction model like ColBERT sits in the middle. It encodes query and document separately into per-token vectors, then computes a MaxSim score at query time over the token grids. You get most of the cross-encoder accuracy with index-friendly precomputation, at the cost of a larger index.

A common shorthand: bi-encoders are fast and dumb, cross-encoders are slow and smart, late-interaction is a compromise that buys part of the smart with a bigger index.

Why cross-encoders matter for AI chatbots

A retrieval-augmented chatbot lives or dies on what it puts in the LLM's context window. If the top three chunks are off-topic, the model either hallucinates a confident answer or politely refuses. A bi-encoder alone often misses the right chunk in the top three even when it finds it in the top fifty. The query "how do I cancel auto-renewal on the Pro plan" might pull in a chunk about "auto-renewal billing FAQ" but rank a chunk titled "Pro plan downgrade steps" lower, because cosine similarity in 384 dimensions is a blunt instrument.

A cross-encoder reads both sides together and notices the missing "cancel" verb, the "Pro" qualifier, and the procedural intent. It typically reorders those fifty candidates so the right answer surfaces at position one or two. On TREC DL 19, the ms-marco-MiniLM-L-6-v2 reranker reaches around 74 NDCG@10 at roughly 1800 documents per second on a V100. That throughput is fine for fifty candidates per query. It would be ruinous for a million.

ChatRaj's reranking option uses a cross-encoder pattern, scoring the top 50 hybrid-retrieved candidates and keeping the top 6 for the LLM. The hybrid retriever (BM25 plus a dense bi-encoder) is the wide funnel; the cross-encoder is the precision filter at the bottom.

When to use a cross-encoder (almost always: reranking)

The rule of thumb is simple. Use a cross-encoder when the candidate set is already small. Use a bi-encoder when it is not.

Concrete shapes:

  • Reranking the top K of a faster retriever. This is the default and the reason every production RAG stack ends up with one. The retriever returns 25 to 100 candidates, the cross-encoder rescores them, and the top 5 to 10 go to the LLM.
  • Pairwise question similarity at small scale. Deduplicating a hundred-question FAQ, or matching a user query against a fixed list of canned answers.
  • Last-mile relevance for high-stakes queries. Legal, medical, or compliance searches where missing the right paragraph is expensive and an extra 200 ms is not.

Where a cross-encoder is the wrong tool: anywhere you would otherwise scan the full corpus. You cannot precompute scores, you cannot index them, and you cannot shard the work across query and document independently. The compute scales linearly with corpus size per query, which is exactly the property that makes vector search worth doing.

Modern checkpoints worth knowing: cross-encoder/ms-marco-MiniLM-L-6-v2 for a fast, well-supported default; the BGE reranker family for stronger multilingual coverage; and the Cohere Rerank API for a managed cross-encoder behind an HTTP endpoint. The BEIR benchmark paper from Thakur and colleagues showed cross-encoder reranking achieving the highest empirical nDCG@10 across most tasks, with the well-known caveat about computational cost. That trade-off is the whole reason this architecture has a specific job in the pipeline rather than the only job.

FAQ

Common Cross-encoder questions

Too slow. A cross-encoder scores one query-document pair per forward pass and cannot precompute document vectors. Ranking a million-document corpus would mean a million forward passes per query, which is why cross-encoders sit downstream of a faster retriever instead of replacing it.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML