ChatRaj
Retrieval & search

BM25

BM25 is a ranking function that scores how relevant a document is to a keyword query.

Bottom line
BM25 is a ranking function that scores how relevant a document is to a keyword query. It combines term frequency with diminishing returns, inverse document frequency for rare-word weighting, and length normalization. Born from 1990s TREC experiments, it remains the default scorer in Lucene, Elasticsearch, OpenSearch, and Postgres full-text search.
Reviewed by ··5 min read
Jump to section

What BM25 actually is

BM25, short for Okapi BM25 (the "BM" stands for Best Matching, and 25 was the iteration number that finally worked), is a probabilistic ranking function that scores documents against a keyword query. It was developed by Stephen Robertson, Karen Sparck Jones, and collaborators at City University London during the TREC information retrieval experiments of the 1990s. The canonical reference is Robertson and Zaragoza's 2009 monograph, "The Probabilistic Relevance Framework: BM25 and Beyond."

Despite being roughly three decades old, BM25 is still the default ranking function in Apache Lucene, Elasticsearch (since version 5.0 in 2016), OpenSearch, Solr, and the ts_rank scorer in Postgres full-text search. Every major search engine in production today either uses BM25 directly or treats it as the baseline a new model has to beat.

The reason it endures is simple. BM25 captures three things that matter for keyword relevance, and it does so with a closed-form formula that runs in microseconds per document.

How BM25 scoring works

For a query Q against a document D, the BM25 score is a sum over each query term t:

code
score(D, Q) = sum over t in Q of  IDF(t) * (f(t, D) * (k1 + 1)) / (f(t, D) + k1 * (1 - b + b * |D| / avgdl))

Three ideas are baked in:

  1. Term frequency with saturation. f(t, D) is how often term t appears in D. The k1 parameter (commonly 1.2 in Elasticsearch and Lucene) controls how fast the contribution of repeated terms saturates. A word appearing 10 times is more relevant than 1 time, but not 10x more, and at 50 times the curve is nearly flat. This is the famous diminishing returns curve.
  2. Inverse document frequency. IDF(t) downweights terms that appear in many documents (like "the" or "and") and upweights rare terms (like "BM25" or a specific SKU).
  3. Length normalization. The b parameter (commonly 0.75) penalizes long documents so a 10,000-word page does not automatically beat a focused 500-word answer just by accumulating term occurrences.

A concrete example

Imagine a three-document corpus about coffee:

  • D1 (50 words): mentions "espresso" twice.
  • D2 (50 words): mentions "espresso" ten times.
  • D3 (5,000 words): mentions "espresso" twenty times.

A naive TF-IDF score would rank D3 highest because it has the most raw occurrences. But intuitively D2 is the most "about espresso" document. BM25 fixes both problems. The saturation curve means D2's ten mentions are worth almost as much as D3's twenty, and the length normalization term (b * |D| / avgdl) heavily penalizes D3 for being a 5,000-word page where "espresso" is diluted. BM25 ranks D2 first, then D1, then D3, which matches human judgment.

This is the core reason BM25 replaced classical TF-IDF in Lucene 6. The math is barely more complex, but the rankings are noticeably better.

Why BM25 matters for AI chatbots

Modern retrieval-augmented generation systems often default to dense vector search using embeddings, but production teams quickly learn that pure vector retrieval has blind spots. Embeddings are great at paraphrase and semantic similarity, and they are terrible at exact keyword recall.

A user typing "error code E_AUTH_401" expects the chatbot to retrieve the doc page that literally contains "E_AUTH_401." An embedding model may map that string to a generic "authentication error" cluster and miss the exact page. BM25, on the other hand, treats E_AUTH_401 as a rare term with high IDF and a strong exact-match signal. It nails the lookup.

That is why ChatRaj's hybrid search runs BM25 alongside dense vector retrieval, then fuses the two ranked lists with Reciprocal Rank Fusion. BM25 catches SKUs, error codes, product names, version numbers, and rare technical terms. Dense retrieval catches paraphrases and conceptual queries. Together they cover both ends of the query distribution.

BM25 also has practical operational virtues for chatbots. It is deterministic, debuggable, and cheap. You can explain to a customer exactly why a document ranked where it did. You cannot do that with a 1,024-dimensional embedding distance.

BM25 vs TF-IDF vs dense retrieval

BM25 vs TF-IDF. Classical TF-IDF multiplies raw term frequency by inverse document frequency. It has no saturation, so a term appearing 100 times scores 100x more than 1 time, which is rarely correct. It also has no built-in length normalization. BM25 fixes both with the k1 and b parameters. Think of BM25 as TF-IDF with two well-chosen knobs.

BM25 vs sparse retrieval. "Sparse retrieval" is the umbrella term for any retrieval method based on a sparse term-document matrix. BM25 is the most widely used sparse algorithm, but the category also includes TF-IDF, language model scoring with Dirichlet smoothing, and learned-sparse methods like SPLADE. When people say "sparse retrieval" in 2026 they usually mean BM25 specifically, unless they say otherwise.

BM25 vs dense retrieval. Dense retrieval encodes queries and documents as fixed-dimensional vectors using a neural model, then ranks by cosine similarity or dot product. It excels at semantic similarity, multilingual search, and paraphrase. It struggles with rare terms, out-of-vocabulary words, and exact-match requirements. BM25 is the opposite. This is why almost every serious production system in 2026 runs hybrid search rather than picking one.

The short version: BM25 is the floor and the baseline. Beat it with a dense model if you can, combine with it if you cannot, but never skip it.

FAQ

Common BM25 questions

Yes. Every major open-source search engine, including Lucene, Elasticsearch, OpenSearch, Solr, and Postgres full-text search, still ships BM25 as the default scorer. Pure dense-vector systems often underperform BM25 on rare keywords, SKUs, and exact-match queries, which is why production teams routinely run hybrid search with BM25 as one component.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML