How is ColBERT different from a regular bi-encoder?

A bi-encoder pools every document into a single vector and compares it to a single query vector. ColBERT stores one vector per token in the document and one vector per token in the query, then scores via MaxSim. The result is that token-level signal is preserved instead of averaged away.

MaxSim is ColBERT's late interaction scoring function. For each query token, it finds the document token with the highest cosine similarity and keeps that maximum. The sum of those per-query-token maxes is the document's relevance score.

It is slower than a bi-encoder but much faster than a cross-encoder. With the PLAID engine, ColBERTv2 returns results in roughly 38 ms on a single GPU and around 100 ms on 8 CPUs, even at scales of 140 million passages, which is viable in production.

Should I use ColBERT instead of OpenAI embeddings?

Maybe. If recall on hard, multi-aspect queries matters more than storage cost and operational simplicity, ColBERT often wins. The tradeoff is that it usually requires self-hosted infrastructure since the major embedding APIs are bi-encoders, and the index is larger.

ColBERTv2 is a 2022 follow-up that adds centroid based vector compression and residual quantization on top of the original ColBERT design. It reduces the late interaction index footprint by 6x to 10x while maintaining state-of-the-art retrieval quality.

What is ColBERT? (Late Interaction Retrieval)

What ColBERT actually is

ColBERT, short for Contextualized Late Interaction over BERT, is a neural retrieval model introduced by Omar Khattab and Matei Zaharia in their 2020 SIGIR paper "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT" out of Stanford. It belongs to a family of approaches called late interaction retrieval, and it sits between two more familiar architectures that most teams already use.

The first familiar architecture is the bi-encoder, also known as a single vector dense retrieval model. A bi-encoder runs BERT (or a similar model) over the document, pools the token outputs into one fixed size vector, and stores that vector. At query time it does the same to the query and ranks documents by cosine similarity. This is fast and cheap, but the pooling step throws away a lot of token level signal.

The second familiar architecture is the cross-encoder. A cross-encoder concatenates the query and the document and runs them through BERT together, so every query token can attend to every document token. The relevance score is highly accurate. The cost is brutal: the model has to re-encode every candidate document for every query, so cross-encoders are usually relegated to re-ranking the top 50 to 200 candidates.

ColBERT is the third way. It encodes documents independently from queries, like a bi-encoder, so document embeddings can be precomputed and indexed offline. But instead of pooling, it keeps the contextualized embedding model output for every token in the document. A 200-token passage produces 200 vectors. Then at query time it encodes the query into per-token vectors as well, and scores the document with a late interaction operator called MaxSim. The "late" part is the key: the query and document tokens interact, but only at scoring time, not during encoding.

How late interaction works (MaxSim explained)

MaxSim is the scoring function. For a query with tokens q_1 through q_m and a document with tokens d_1 through d_n, the relevance score is:

score(q, d) = Σ_i max_j (q_i · d_j)

In plain English: for each query token, find the document token it matches best (the one with highest cosine similarity). Take that best match score. Then add up the best match scores across all query tokens. That sum is the document's relevance.

Two things are interesting about this. First, the maxes are local. A query token "refund" can match the document token "refund" with a score of 0.91, while another query token "policy" pairs best with "terms" at 0.62. Each query token finds its own anchor. Second, the operator is differentiable and embarrassingly parallel, so GPUs eat it for breakfast.

Two big follow-ups shaped how ColBERT is used in production. ColBERTv2 (Santhanam et al. 2022, NAACL) added centroid based vector compression and residual quantization. Document token vectors are bucketed to a nearest centroid, and only the residual is stored at low precision. The result is a 6x to 10x smaller index at similar accuracy. PLAID (also Santhanam et al. 2022) is the production engine that uses centroid pruning to skip low-scoring passages early, pushing ColBERTv2 to roughly 38 ms on a single GPU and around 100 ms on 8 CPUs even on 140 million passages.

Why ColBERT matters for AI chatbots

For a support or knowledge chatbot, retrieval quality is everything. If the retriever misses the right passage, the language model can not rescue the answer. The classic failure case for single vector dense retrieval is multi-aspect queries: "how do I cancel my subscription if I am still inside the trial". A pooled embedding blurs "cancel", "subscription", and "trial" into one direction, and the closest stored document might match on "cancel" alone.

Late interaction handles this naturally. Each query token can fire on a different region of the document. That is why ColBERT style models tend to win on harder, longer, more compositional queries while bi-encoders are perfectly fine on short factoid lookups.

The catch is storage. Storing one vector per token, even compressed, costs more than one vector per document. For a small FAQ index it is irrelevant. For tens of millions of long documents it matters. ColBERT also benefits from self-hosted infrastructure; the popular OpenAI and Voyage embedding APIs are bi-encoders, not late interaction. RAGatouille and the official Stanford NLP ColBERT repo are the most common starting points for adopters.

ChatRaj uses bi-encoder dense retrieval today, paired with reranking for hard cases. ColBERT style late interaction is a candidate upgrade for harder semantic queries where bi-encoder recall starts to slip and a full cross-encoder is too slow to run as the first stage.

ColBERT vs dense retrieval vs cross-encoders

The three architectures map cleanly to a tradeoff table:

Bi-encoder dense retrieval stores one vector per document. Query time cost is one vector dot product per candidate. Accuracy is the lowest of the three. Storage is the cheapest.
ColBERT stores one vector per document token (compressed in v2). Query time cost is a MaxSim sum across query and document tokens. Accuracy is close to a cross-encoder on hard queries. Storage is the highest among first stage retrievers.
Cross-encoder jointly encodes query and document. There is no precomputed index, because the encoding depends on the query. Accuracy is the highest. Query time cost is impractical for first stage retrieval over millions of docs, so it is used to rerank a candidate set.

A common production pattern: bi-encoder or BM25 to fetch the top 1000, ColBERT to rerank to top 50, cross-encoder to rerank to top 5. ColBERT can also serve as the first stage retriever on its own, which is what PLAID enables.

If you remember one thing: ColBERT trades index size for retrieval accuracy by keeping per-token signal that bi-encoders pool away, then recovers it cheaply via MaxSim at query time. That is the entire idea.

ColBERT

What ColBERT actually is

How late interaction works (MaxSim explained)

Why ColBERT matters for AI chatbots

ColBERT vs dense retrieval vs cross-encoders

Common ColBERT questions

Sources & further reading

Ship your first chatbot in 60 seconds.

ColBERT

What ColBERT actually is

How late interaction works (MaxSim explained)

Why ColBERT matters for AI chatbots

ColBERT vs dense retrieval vs cross-encoders

Related terms

Common ColBERT questions

Sources & further reading

Ship your first chatbot in 60 seconds.