ChatRaj
Retrieval & search

TF-IDF

TF-IDF (Term Frequency, Inverse Document Frequency) is a statistical weighting that scores how distinctive a word is to a document.

Bottom line
TF-IDF (Term Frequency, Inverse Document Frequency) is a statistical weighting that scores how distinctive a word is to a document. It multiplies how often the word appears in the document by the logarithm of how rare the word is across the full corpus. It powered the first generation of keyword search engines.
Reviewed by ··5 min read
Jump to section

What TF-IDF actually is

TF-IDF is a way to answer a simple question: which words in this document carry real information, and which ones are just filler? The intuition is that a word like "the" appears in almost every document, so it tells you nothing about what a document is about. A word like "hyperparameter" appears in only a few, so when it shows up, it is a strong signal.

The score has two halves. Term frequency (TF) counts how often a word shows up in one document. Inverse document frequency (IDF) measures how rare that word is across the entire collection. Multiply them together and you get a weight per (word, document) pair: high when a word is used a lot in this document but rarely elsewhere, low otherwise.

The IDF idea was introduced by Karen Spärck Jones in her 1972 paper "A Statistical Interpretation of Term Specificity and Its Application in Retrieval" (Journal of Documentation, 28(1), pages 11 to 21). The paper argued that terms should be weighted by collection frequency, not by meaning. That single insight underpinned roughly the next 35 years of information retrieval and is still the conceptual core of every modern search ranker, including BM25.

How TF-IDF scoring works

The classic formulation is:

code
TF-IDF(t, d) = TF(t, d) * IDF(t)
IDF(t)       = log( N / df(t) )

where N is the number of documents in the corpus and df(t) is the number of documents that contain term t. The logarithm compresses the IDF range so that doubling the corpus does not double a term's weight, and so common words do not crowd out rare ones.

In practice you almost never see the textbook formula. Scikit-learn's TfidfVectorizer, the most widely used implementation, smooths the IDF by default to avoid division by zero and to keep terms that appear in every document from getting an IDF of zero. With smooth_idf=True, the default, the formula is:

code
IDF(t) = log( (1 + N) / (1 + df(t)) ) + 1

That extra "+1" is a deliberate scikit-learn choice: it makes sure a term that appears in every document still gets a small positive weight rather than being silently dropped.

Concrete example. Suppose your corpus has 5 documents. The word "the" appears in all 5, so df = 5 and the textbook IDF is log(5/5) = 0. With scikit-learn smoothing it is log(6/6) + 1 = 1, the minimum possible IDF. The word "hyperparameter" appears in only 1 of the 5 documents, so the textbook IDF is log(5/1) ~ 1.61 and the smoothed scikit-learn IDF is log(6/2) + 1 ~ 2.10. Multiply by term frequency in a given document and "hyperparameter" dominates the ranking signal, exactly as you would want.

After scoring, the per-document vector is usually L2-normalized so document length does not skew comparisons, and then queries are matched against documents using cosine similarity. That whole pipeline (tokenize, count, weight, normalize, dot product) is one of the most-shipped algorithms in computing history.

Why TF-IDF matters for AI chatbots

For an AI chatbot doing retrieval-augmented generation, TF-IDF is a useful piece of vocabulary even when you are not using it directly. It explains why some retrieval systems still miss obvious matches: a dense retrieval system using an embedding model sees text in semantic space, but it can blur exact terms like product codes, error messages, or proper nouns. A keyword scorer that weights those rare tokens highly tends to find them every time.

That is why most production chatbot retrieval is hybrid: a dense vector arm for "what does this mean" and a sparse keyword arm for "did the user mention this specific token." ChatRaj uses BM25 as the keyword arm of its hybrid retrieval (BM25 is the modern descendant of TF-IDF). Understanding TF-IDF makes it easier to reason about why hybrid setups work and when keyword recall matters more than semantic recall.

TF-IDF is also still the right answer for plenty of standalone use cases: classifying support tickets with logistic regression, deduping documents in a small corpus, ranking log lines, building a quick search box over a knowledge base of a few thousand pages. Scikit-learn's TfidfVectorizer remains the default starting point for almost every text classification pipeline in Python.

TF-IDF vs BM25: the upgrade you didn't know happened

The honest summary is that BM25 has replaced TF-IDF in every modern search engine. Lucene, Elasticsearch, OpenSearch, Solr, and Postgres full-text search all use BM25 or a close variant by default. If you are running a Lucene-based search engine today, you are not running TF-IDF.

BM25 fixes two real weaknesses of TF-IDF. First, TF in classic TF-IDF grows linearly with term count, so a document that mentions a query word 50 times scores roughly 10 times higher than one that mentions it 5 times, which is rarely what you want. BM25 saturates term frequency through a tunable k1 parameter, so the 50th mention barely adds to the score. Second, TF-IDF has no built-in length normalization, so long documents tend to dominate just by having more words to match. BM25 normalizes by document length relative to the corpus average through a tunable b parameter.

So when should you reach for TF-IDF today? Small static corpora where simplicity matters. Baseline implementations where you want a transparent score. Scikit-learn pipelines feeding a downstream classifier. Log analysis tools. Anywhere you need a vectorizer in 20 lines of Python without a search index. For production search and RAG, use BM25 (or hybrid) and treat TF-IDF as the algorithm that taught the field how to think about term importance.

FAQ

Common TF-IDF questions

Less often for ranking. BM25 has replaced TF-IDF in every modern search engine, including Lucene, Elasticsearch, OpenSearch, and Postgres full-text search. But scikit-learn's TfidfVectorizer remains the default vectorizer for many text classification, clustering, and small-corpus search pipelines in Python.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML