Why is dense retrieval better than BM25?

It generalizes across paraphrases and synonyms, so semantically related queries match even when no keywords overlap. The tradeoff is weaker performance on rare exact match terms like SKU codes, where BM25 still beats it.

What model is used to embed text for dense retrieval?

Common choices are OpenAI text-embedding-3-small or large, Cohere Embed v3, BGE and E5 from open source, and proprietary in-house models. The right pick depends on language coverage, dimension budget, and self-hosting needs.

How much faster is dense retrieval than BM25?

At query time they are comparable when the dense index uses HNSW, which gives logarithmic complexity. Indexing is much heavier for dense retrieval because every chunk has to be embedded, where BM25 only counts tokens.

Does dense retrieval require a GPU?

Not at query time when you call a hosted embeddings API over HTTP. For self-hosted models a GPU speeds up indexing of large corpora substantially, and is effectively required if you are embedding millions of documents.

What dimension should the embeddings have?

Modern models commonly produce 1024 or 1536 dimensions. Some, like text-embedding-3-large, support Matryoshka representations, which let you truncate to a smaller dimension and trade a little quality for a lot of storage savings.

What is Dense Retrieval? (Embedding Based Search Explained)

What dense retrieval actually is

Dense retrieval is the search pattern where every piece of text, your documents and the user's question, gets compressed into a fixed length vector of floating point numbers. Matching is then a geometry problem: find the document vectors closest to the query vector in that high dimensional space. "Dense" refers to the encoding. Every dimension carries a small piece of meaning, so the vectors have very few zeros, in contrast to sparse retrieval where most dimensions are zero because each one tracks a specific vocabulary word.

The shift dense retrieval enabled is from lexical matching to semantic matching. A user typing "how do I get a refund" can match a help center chunk titled "return policy" because the two phrases sit near each other in vector space, even though they share zero keywords. Classical BM25 would miss that pairing unless the document also literally used the word refund.

How dense retrieval works end to end

The pipeline has two phases, indexing and querying, and they mirror each other.

During indexing, you take each document chunk, usually a few hundred tokens, and pass it through an embedding model. The model emits a fixed-dimension vector, commonly 768, 1024, or 1536 dimensions depending on the model. You then write each vector, along with a pointer back to the original chunk, into a vector database. The database does not store the raw vector as a flat list and scan it linearly at query time. Instead it builds an approximate nearest neighbor index. The two structures you will see most often are HNSW, a layered proximity graph, and IVF, which partitions vectors into clusters and only searches the nearest few. HNSW gives logarithmic search complexity and is the default in pgvector, Pinecone, and Weaviate.

At query time you embed the user's question using the same model, then ask the index for the top k nearest neighbors. Distance is usually cosine similarity or dot product, with cosine being the standard for normalized embeddings. The database returns the k chunks with the highest similarity scores, your application passes them to the LLM as context, and the LLM writes an answer.

The technique was popularized by the Dense Passage Retrieval paper, Karpukhin et al. 2020 at Facebook AI Research. They trained a dual BERT encoder, one tower for questions and one for passages, with in-batch negatives, and showed that the resulting dense retriever beat a strong Lucene BM25 baseline by nine to nineteen absolute points on top-20 retrieval accuracy across a battery of open-domain QA datasets. That result is the reason most modern retrieval-augmented generation stacks default to dense retrieval as a starting point.

Concrete example. Imagine a 200 page e-commerce help center. You chunk it into 800 passages, embed each one with text-embedding-3-small at 1536 dimensions, and load the vectors into pgvector with an HNSW index. A user asks "the package never showed up, what do I do." The query vector lands near chunks about "lost shipment claims" and "delivery investigation," neither of which contains the word package. The retriever returns those, the LLM grounds its answer in them, and the customer gets a useful response instead of "I do not have information about that."

Why dense retrieval matters for AI chatbots

Most chatbot questions are paraphrases of something already documented. Customers rarely use the exact wording from your help center. Dense retrieval is the mechanism that lets the bot recognize the paraphrase as a known question. Without it, recall collapses every time the user's vocabulary diverges from the documentation, which is most of the time.

There are real failure modes you have to plan around. Rare named entities like SKU codes, model numbers, and proper nouns often get poor representations because the embedding model has not seen them often enough during training. Out of domain terms behave similarly. And anything that depends on exact string matching, an order ID, a coupon code, a phone number, will not survive the round trip through an embedding. The vector is an approximation of meaning, and meaning blurs precise identifiers.

Cost is the other constraint. Dense retrieval is not free. You pay for embedding inference twice. Once when you build the index, which is a per-token cost across the entire corpus, and again on every query. For self-hosted models a GPU drastically speeds up indexing a large corpus, but at query time the embedding call is fast enough that hosted APIs serve it over HTTP without GPU access on the application side.

ChatRaj pairs dense retrieval with BM25 in a hybrid search pipeline so neither blind spot dominates. Semantic recall from the dense side, exact token precision from the sparse side, and reciprocal rank fusion to merge the two ranked lists.

Dense retrieval vs sparse retrieval: when each wins

The two methods are not rivals so much as complements. They fail in opposite directions.

Sparse retrieval, BM25 and TF-IDF, wins on rare exact match terms, technical identifiers, person names, and any query where the user typed the same string that appears in the document. It also wins on cost. There is no embedding model in the loop, no GPU bill at indexing time, and the math is decades old and extremely well optimized.

Dense retrieval wins on paraphrase, synonymy, and cross-lingual matching. It generalizes better when the user's vocabulary differs from the document's, which is the common case in customer support. It loses on rare tokens, on numerical reasoning, and on any query that hinges on a specific string the user typed.

The practical recommendation in production systems, and the one most retrieval research has converged on, is to run both and fuse the rankings. That is what hybrid search does, and it is why pure dense retrieval is rarely the final architecture in a serious RAG stack.

Dense retrieval

What dense retrieval actually is

How dense retrieval works end to end

Why dense retrieval matters for AI chatbots

Dense retrieval vs sparse retrieval: when each wins

Common Dense retrieval questions

Sources & further reading

Ship your first chatbot in 60 seconds.

Dense retrieval

What dense retrieval actually is

How dense retrieval works end to end

Why dense retrieval matters for AI chatbots

Dense retrieval vs sparse retrieval: when each wins

Related terms

Common Dense retrieval questions

Sources & further reading

Ship your first chatbot in 60 seconds.