Is BM25 still relevant in 2026?

Yes. Every major open-source search engine, including Lucene, Elasticsearch, OpenSearch, Solr, and Postgres full-text search, still ships BM25 as the default scorer. Pure dense vector systems often underperform BM25 on rare keywords, SKUs, and exact match queries, which is why production teams routinely run hybrid search with BM25 as one component.

What is the difference between BM25 and TF-IDF?

TF-IDF multiplies raw term frequency by inverse document frequency with no upper bound and no adjustment for document length. BM25 adds two corrections: term frequency saturates via the k1 parameter (so repeated terms hit diminishing returns), and document length is normalized via the b parameter (so long pages do not win just by being long).

Can BM25 handle typos?

Not on its own. BM25 matches exact tokens after analysis (lowercasing, stemming, stopword removal). To handle typos you need to pair it with fuzzy matching, edit distance query expansion, or a phonetic algorithm like Double Metaphone. Most search engines expose these as separate query types.

Does Elasticsearch use BM25 by default?

Yes, since Elasticsearch 5.0 released in October 2016, which adopted Lucene 6's switch from classical TF-IDF to BM25Similarity. The defaults are k1 = 1.2 and b = 0.75, and both are tunable per field.

What are k1 and b in BM25?

k1 controls how quickly term frequency saturates. Higher k1 means more credit for repeated terms; lower k1 means earlier saturation. b controls how strongly document length is normalized. b = 0 disables length normalization entirely, b = 1 fully normalizes. The Lucene and Elasticsearch defaults of k1 = 1.2 and b = 0.75 work well for most English corpora.

What is BM25? (Ranking Function for Keyword Search)

What BM25 actually is

BM25, short for Okapi BM25 (the "BM" stands for Best Matching, and 25 was the iteration number that finally worked), is a probabilistic ranking function that scores documents against a keyword query. It was developed by Stephen Robertson, Karen Sparck Jones, and collaborators at City University London during the TREC information retrieval experiments of the 1990s. The canonical reference is Robertson and Zaragoza's 2009 monograph, "The Probabilistic Relevance Framework: BM25 and Beyond."

Despite being roughly three decades old, BM25 is still the default ranking function in Apache Lucene, Elasticsearch (since version 5.0 in 2016), OpenSearch, Solr, and the ts_rank scorer in Postgres full-text search. Every major search engine in production today either uses BM25 directly or treats it as the baseline a new model has to beat.

The reason it endures is simple. BM25 captures three things that matter for keyword relevance, and it does so with a closed-form formula that runs in microseconds per document.

How BM25 scoring works

For a query Q against a document D, the BM25 score is a sum over each query term t:

code

score(D, Q) = sum over t in Q of  IDF(t) * (f(t, D) * (k1 + 1)) / (f(t, D) + k1 * (1 - b + b * |D| / avgdl))

Three ideas are baked in:

Term frequency with saturation. f(t, D) is how often term t appears in D. The k1 parameter (commonly 1.2 in Elasticsearch and Lucene) controls how fast the contribution of repeated terms saturates. A word appearing 10 times is more relevant than 1 time, but not 10x more, and at 50 times the curve is nearly flat. This is the famous diminishing returns curve.
Inverse document frequency. IDF(t) downweights terms that appear in many documents (like "the" or "and") and upweights rare terms (like "BM25" or a specific SKU).
Length normalization. The b parameter (commonly 0.75) penalizes long documents so a 10,000-word page does not automatically beat a focused 500-word answer just by accumulating term occurrences.

A concrete example

Imagine a three-document corpus about coffee:

D1 (50 words): mentions "espresso" twice.
D2 (50 words): mentions "espresso" ten times.
D3 (5,000 words): mentions "espresso" twenty times.

A naive TF-IDF score would rank D3 highest because it has the most raw occurrences. But intuitively D2 is the most "about espresso" document. BM25 fixes both problems. The saturation curve means D2's ten mentions are worth almost as much as D3's twenty, and the length normalization term (b * |D| / avgdl) heavily penalizes D3 for being a 5,000-word page where "espresso" is diluted. BM25 ranks D2 first, then D1, then D3, which matches human judgment.

This is the core reason BM25 replaced classical TF-IDF in Lucene 6. The math is barely more complex, but the rankings are noticeably better.

Why BM25 matters for AI chatbots

Modern retrieval-augmented generation systems often default to dense vector search using embeddings, but production teams quickly learn that pure vector retrieval has blind spots. Embeddings are great at paraphrase and semantic similarity, and they are terrible at exact keyword recall.

A user typing "error code E_AUTH_401" expects the chatbot to retrieve the doc page that literally contains "E_AUTH_401." An embedding model may map that string to a generic "authentication error" cluster and miss the exact page. BM25, on the other hand, treats E_AUTH_401 as a rare term with high IDF and a strong exact match signal. It nails the lookup.

That is why ChatRaj's hybrid search runs BM25 alongside dense vector retrieval, then fuses the two ranked lists with Reciprocal Rank Fusion. BM25 catches SKUs, error codes, product names, version numbers, and rare technical terms. Dense retrieval catches paraphrases and conceptual queries. Together they cover both ends of the query distribution.

BM25 also has practical operational virtues for chatbots. It is deterministic, debuggable, and cheap. You can explain to a customer exactly why a document ranked where it did. You cannot do that with a 1,024-dimensional embedding distance.

BM25 vs TF-IDF vs dense retrieval

BM25 vs TF-IDF. Classical TF-IDF multiplies raw term frequency by inverse document frequency. It has no saturation, so a term appearing 100 times scores 100x more than 1 time, which is rarely correct. It also has no built-in length normalization. BM25 fixes both with the k1 and b parameters. Think of BM25 as TF-IDF with two well-chosen knobs.

BM25 vs sparse retrieval. "Sparse retrieval" is the umbrella term for any retrieval method based on a sparse term document matrix. BM25 is the most widely used sparse algorithm, but the category also includes TF-IDF, language model scoring with Dirichlet smoothing, and learned sparse methods like SPLADE. When people say "sparse retrieval" in 2026 they usually mean BM25 specifically, unless they say otherwise.

BM25 vs dense retrieval. Dense retrieval encodes queries and documents as fixed dimensional vectors using a neural model, then ranks by cosine similarity or dot product. It excels at semantic similarity, multilingual search, and paraphrase. It struggles with rare terms, out-of-vocabulary words, and exact match requirements. BM25 is the opposite. This is why almost every serious production system in 2026 runs hybrid search rather than picking one.

The short version: BM25 is the floor and the baseline. Beat it with a dense model if you can, combine with it if you cannot, but never skip it.

BM25

What BM25 actually is

How BM25 scoring works

A concrete example

Why BM25 matters for AI chatbots

BM25 vs TF-IDF vs dense retrieval

Common BM25 questions

Sources & further reading

Ship your first chatbot in 60 seconds.

BM25

What BM25 actually is

How BM25 scoring works

A concrete example

Why BM25 matters for AI chatbots

BM25 vs TF-IDF vs dense retrieval

Related terms

Common BM25 questions

Sources & further reading

Ship your first chatbot in 60 seconds.