Is TF-IDF still used in production?

Less often for ranking. BM25 has replaced TF-IDF in every modern search engine, including Lucene, Elasticsearch, OpenSearch, and Postgres full text search. But scikit-learn's TfidfVectorizer remains the default vectorizer for many text classification, clustering, and small corpus search pipelines in Python.

What is the formula for TF-IDF?

The textbook formula is TF(t, d) * log(N / df(t)), where N is the number of documents in the corpus and df(t) is how many of those contain term t. Scikit-learn's default smoothed variant is TF(t, d) * (log((1 + N) / (1 + df(t))) + 1).

Why does TF-IDF use a logarithm?

The log compresses the IDF range. Without it, a term that appears in 1 of 1,000,000 documents would get a weight a million times higher than a term that appears in 1 of 1 document. The log keeps the ratio sane so common words do not get zeroed out and rare words do not dominate everything.

Can TF-IDF handle synonyms?

No. TF-IDF only sees literal tokens, so 'car' and 'automobile' look completely unrelated to it. Handling synonyms requires dense embeddings produced by a neural embedding model, which is why most modern chatbot retrieval combines TF-IDF or BM25 with embeddings in a hybrid setup.

Is TF-IDF the same as bag of words?

Not quite. Bag of words is the representation: each document becomes a vector of token counts with no order. TF-IDF is a reweighting on top of that representation, replacing raw counts with TF * IDF scores so distinctive terms get higher weights than common ones.

What is TF-IDF? (Term Frequency, Inverse Document Frequency)

What TF-IDF actually is

TF-IDF is a way to answer a simple question: which words in this document carry real information, and which ones are just filler? The intuition is that a word like "the" appears in almost every document, so it tells you nothing about what a document is about. A word like "hyperparameter" appears in only a few, so when it shows up, it is a strong signal.

The score has two halves. Term frequency (TF) counts how often a word shows up in one document. Inverse document frequency (IDF) measures how rare that word is across the entire collection. Multiply them together and you get a weight per (word, document) pair: high when a word is used a lot in this document but rarely elsewhere, low otherwise.

The IDF idea was introduced by Karen Spärck Jones in her 1972 paper "A Statistical Interpretation of Term Specificity and Its Application in Retrieval" (Journal of Documentation, 28(1), pages 11 to 21). The paper argued that terms should be weighted by collection frequency, not by meaning. That single insight underpinned roughly the next 35 years of information retrieval and is still the conceptual core of every modern search ranker, including BM25.

How TF-IDF scoring works

The classic formulation is:

code

TF-IDF(t, d) = TF(t, d) * IDF(t)
IDF(t)       = log( N / df(t) )

where N is the number of documents in the corpus and df(t) is the number of documents that contain term t. The logarithm compresses the IDF range so that doubling the corpus does not double a term's weight, and so common words do not crowd out rare ones.

In practice you almost never see the textbook formula. Scikit-learn's TfidfVectorizer, the most widely used implementation, smooths the IDF by default to avoid division by zero and to keep terms that appear in every document from getting an IDF of zero. With smooth_idf=True, the default, the formula is:

code

IDF(t) = log( (1 + N) / (1 + df(t)) ) + 1

That extra "+1" is a deliberate scikit-learn choice: it makes sure a term that appears in every document still gets a small positive weight rather than being silently dropped.

Concrete example. Suppose your corpus has 5 documents. The word "the" appears in all 5, so df = 5 and the textbook IDF is log(5/5) = 0. With scikit-learn smoothing it is log(6/6) + 1 = 1, the minimum possible IDF. The word "hyperparameter" appears in only 1 of the 5 documents, so the textbook IDF is log(5/1) ~ 1.61 and the smoothed scikit-learn IDF is log(6/2) + 1 ~ 2.10. Multiply by term frequency in a given document and "hyperparameter" dominates the ranking signal, exactly as you would want.

After scoring, the per-document vector is usually L2 normalized so document length does not skew comparisons, and then queries are matched against documents using cosine similarity. That whole pipeline (tokenize, count, weight, normalize, dot product) is one of the most shipped algorithms in computing history.

Why TF-IDF matters for AI chatbots

For an AI chatbot doing retrieval-augmented generation, TF-IDF is a useful piece of vocabulary even when you are not using it directly. It explains why some retrieval systems still miss obvious matches: a dense retrieval system using an embedding model sees text in semantic space, but it can blur exact terms like product codes, error messages, or proper nouns. A keyword scorer that weights those rare tokens highly tends to find them every time.

That is why most production chatbot retrieval is hybrid: a dense vector arm for "what does this mean" and a sparse keyword arm for "did the user mention this specific token." ChatRaj uses BM25 as the keyword arm of its hybrid retrieval (BM25 is the modern descendant of TF-IDF). Understanding TF-IDF makes it easier to reason about why hybrid setups work and when keyword recall matters more than semantic recall.

TF-IDF is also still the right answer for plenty of standalone use cases: classifying support tickets with logistic regression, deduping documents in a small corpus, ranking log lines, building a quick search box over a knowledge base of a few thousand pages. Scikit-learn's TfidfVectorizer remains the default starting point for almost every text classification pipeline in Python.

TF-IDF vs BM25: the upgrade you didn't know happened

The honest summary is that BM25 has replaced TF-IDF in every modern search engine. Lucene, Elasticsearch, OpenSearch, Solr, and Postgres full text search all use BM25 or a close variant by default. If you are running a Lucene based search engine today, you are not running TF-IDF.

BM25 fixes two real weaknesses of TF-IDF. First, TF in classic TF-IDF grows linearly with term count, so a document that mentions a query word 50 times scores roughly 10 times higher than one that mentions it 5 times, which is rarely what you want. BM25 saturates term frequency through a tunable k1 parameter, so the 50th mention barely adds to the score. Second, TF-IDF has no built-in length normalization, so long documents tend to dominate just by having more words to match. BM25 normalizes by document length relative to the corpus average through a tunable b parameter.

So when should you reach for TF-IDF today? Small static corpora where simplicity matters. Baseline implementations where you want a transparent score. Scikit-learn pipelines feeding a downstream classifier. Log analysis tools. Anywhere you need a vectorizer in 20 lines of Python without a search index. For production search and RAG, use BM25 (or hybrid) and treat TF-IDF as the algorithm that taught the field how to think about term importance.

TF-IDF

What TF-IDF actually is

How TF-IDF scoring works

Why TF-IDF matters for AI chatbots

TF-IDF vs BM25: the upgrade you didn't know happened

Common TF-IDF questions

Sources & further reading

Ship your first chatbot in 60 seconds.

TF-IDF

What TF-IDF actually is

How TF-IDF scoring works

Why TF-IDF matters for AI chatbots

TF-IDF vs BM25: the upgrade you didn't know happened

Related terms

Common TF-IDF questions

Sources & further reading

Ship your first chatbot in 60 seconds.