ChatRaj
Definition

What is RAG (Retrieval-Augmented Generation)?

The plain-English explanation of the architecture behind almost every production AI chatbot in 2026.

Read the explainer
Bottom line
Retrieval-Augmented Generation (RAG) is a technique that gives a large language model access to outside text at query time. The pipeline is three steps: (1) retrieve a small set of relevant passages from a knowledge base using vector search, keyword search, or both; (2) augment the user prompt by stuffing those passages into the context window; (3) generate the answer with the LLM, grounded in the retrieved text. RAG was introduced by Lewis et al. in a 2020 Facebook AI paper and has become the default way to make a general-purpose LLM answer questions about private or recent data.
Reviewed by ··11 min read
Jump to section

The plain-English RAG definition

Retrieval-Augmented Generation, almost always called RAG, is a way to make a large language model answer questions using a specific body of text that the model was never trained on. That body of text might be your help center, your product docs, six months of support tickets, or the archive of a research lab. RAG lets the LLM cite that material in its answer without anyone retraining the model.

The name describes the architecture exactly. At query time, a retrieval step pulls relevant passages out of your knowledge base. Those passages are augmented onto the user's prompt. The combined prompt is sent to a generation model (the LLM), which writes the final answer. Three steps: retrieve, augment, generate.

RAG exists because base LLMs have two problems that show up the minute you try to use them for real work. The first is hallucination: a model trained on the public internet has learned the statistical shape of confident-sounding answers, and will produce one even when it has no idea what the truth is. The second is the knowledge cutoff: every model is trained on a snapshot of text that ends on a specific date and has no idea what happened after, or what is true inside your private systems. RAG addresses both by giving the LLM the exact passages to use as evidence.

The technique was formalized in a 2020 NeurIPS paper by Patrick Lewis and colleagues at what was then Facebook AI Research, titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." That paper showed that combining a parametric model (the LLM) with a non-parametric memory (an external index of text) produced more factual and specific answers than the LLM alone. Six years later, RAG is the default pattern for building production AI assistants over private data.

The 3-step RAG architecture

Step one is retrieval. You take the user's question and run it against an index of your knowledge base. The index returns a small ranked list of passages, typically five to twenty. The retrieval mechanism is usually a vector database (semantic similarity), a keyword search engine (exact-term overlap), or both.

Step two is augmentation. You paste the retrieved passages into the prompt sent to the LLM. A typical augmented prompt: "Here are passages from our knowledge base. [PASSAGE 1] [PASSAGE 2] [PASSAGE 3]. Using only these passages, answer: [QUESTION]." That preamble tells the LLM what counts as ground truth (the passages) and what does not (its own pretraining memory).

Step three is generation. The LLM produces an answer. Production RAG systems ask the LLM to cite which passage each claim came from, so the user can verify the answer rather than trust it blindly. A RAG system that cannot show its sources is indistinguishable from a hallucination.

Production systems wrap more machinery around this loop: rewriting the question into a better retrieval query, deduplicating passages, reranking with a stronger model, and post-processing citations. But the three-step skeleton is always present.

How retrieval works in practice

The retrieval step is the part of RAG that gets the most engineering attention because it determines answer quality. If retrieval surfaces the wrong passages, generation cannot recover.

Retrieval starts with chunking. You cannot index a thousand-page document as one giant blob; the retrieval step needs to return short focused passages, not entire books. So you split each document into chunks of a few hundred tokens, with some overlap between adjacent chunks so a sentence that straddles a boundary does not get split in half. Choosing a good chunk size is more art than science. Too small and chunks lose context; too large and the LLM sees too much irrelevant text along with the relevant bit.

Each chunk gets turned into an embedding: a list of numbers (usually 384 to 3072 dimensions) that represents the meaning of the chunk. Embeddings are produced by a small neural network trained for this job. Open-source models like the BGE family from BAAI (Xiao et al., "C-Pack: Packed Resources For General Chinese Embeddings", 2023) and proprietary models from OpenAI, Cohere, and Voyage all produce vectors with the property that semantically similar passages end up near each other in vector space. The query gets embedded the same way; vector search finds the chunks whose embeddings are nearest by cosine similarity.

Pure vector search has a well-known failure mode: it is bad at exact-term matching. If a user asks "what is error code E-204" and the docs contain that literal string in one specific paragraph, vector search might return passages about general error handling instead. Keyword search handles this case naturally. BM25, described in "The Probabilistic Relevance Framework: BM25 and Beyond" (Robertson and Zaragoza, 2009), ranks passages by how rare and frequent each query term is. It is decades old and still the strongest baseline for exact-term retrieval.

Hybrid retrieval runs both systems and fuses the rankings. Reciprocal Rank Fusion, introduced by Cormack, Clarke, and Buttcher in a 2009 SIGIR paper, is the standard fusion algorithm: each document gets a score of 1 divided by (k plus its rank in each system), and the scores are summed across systems. RRF is one line of math and has the useful property of not caring about how each system's scores are calibrated. It just cares about ranks. Hybrid retrieval reliably outperforms either system alone on real-world queries.

A separate strand of retrieval research, late-interaction models like ColBERT (Khattab and Zaharia, SIGIR 2020), encodes each token of the query and the document independently and computes similarity at the token level rather than the document level. ColBERT trades index size for retrieval quality and is increasingly common in high-stakes RAG systems.

A worked example using ChatRaj

To make the pipeline concrete, here is what happens when a visitor asks a ChatRaj-powered chatbot "what is your refund policy."

When the operator first set up the bot, ChatRaj crawled the website. Each page was chunked into roughly 500-token passages with 50 tokens of overlap. Each chunk was embedded into a 768-dimensional vector and stored in a per-bot vector index, alongside a BM25 inverted index. The refund-policy page produced a chunk containing the literal sentence "We offer a full refund within 30 days of purchase."

When the visitor sends the question, the backend embeds the query into the same vector space. Vector search retrieves the top twenty chunks whose embeddings are most similar. In parallel, BM25 retrieves the top twenty chunks containing the exact terms "refund" and "policy." The two ranked lists are fused with Reciprocal Rank Fusion and the top five are kept.

Those five passages are pasted into a prompt template like: "Answer using only the passages below. Cite each claim with [Source N]. Passages: [1] We offer a full refund within 30 days of purchase. [2] ... [QUESTION] What is your refund policy?" The LLM generates: "We offer a full refund within 30 days of purchase [Source 1]. Refunds are processed to your original payment method within 5 business days [Source 2]." The visitor can click each [Source N] to verify the passage. If the question is outside the knowledge base, the LLM says it does not know rather than making one up. The pipeline runs in under a second.

RAG vs fine-tuning vs prompt engineering

RAG is one of three common techniques for getting an LLM to use specific information. The other two are fine-tuning (continuing the model's training on your data) and prompt engineering (stuffing the information directly into every prompt).

Fine-tuning bakes knowledge into the model's weights. It is the right tool for teaching a style, a tone, a domain-specific format, or a task the base model is not good at. It is the wrong tool for "give the model access to my changing knowledge base," because every update requires another fine-tuning run.

Prompt engineering stuffs everything into the context window. It works when the entire knowledge base fits in a single prompt and when you do not mind paying for that context on every call. It does not scale to knowledge bases larger than the context window, and it gets expensive fast.

RAG is the choice when the knowledge base is large, when it changes frequently, when you want citations, and when you want each query to pay only for the passages it actually needs. Almost every production AI chatbot in 2026 uses RAG for these reasons.

Common RAG failures (and how production systems mitigate them)

Three failure modes show up over and over. Knowing what they look like is most of the battle.

Retrieval irrelevance is when the retriever returns passages that are not actually relevant, and the LLM either makes something up or refuses to answer. Mitigations: hybrid retrieval, query rewriting, and a reranker that re-scores the top-N candidates with a stronger cross-encoder before passing them to the LLM.

Stale content is when the knowledge base has not been re-indexed since a document changed. Mitigations: automatic recrawling on a schedule, webhook-triggered reindex when the source CMS publishes, and a freshness signal so newer passages are preferred when two contradict.

Citation drift is when the LLM cites a passage but the cited passage does not actually support the claim. Mitigations: post-generation citation verification (a second model checks whether each cited passage entails the cited claim) and falling back to "I don't know" when verification fails.

A RAG system without these mitigations works in demos and fails in production. A RAG system with them is the boring, reliable architecture behind most of the AI assistants you actually trust.

RAG in 2026

The three-step pipeline has not changed since the 2020 Lewis paper, but the components have evolved fast. Hybrid retrieval is now the default; pure-vector retrieval is a starter setup. Serious systems combine BM25, dense vectors, and often a late-interaction model like ColBERTv2 (Santhanam, Khattab, Saad-Falcon, Potts, and Zaharia, 2022), fused via RRF. Rerankers are standard: after retrieval returns 50 or 100 candidates, a cross-encoder reranks them down to the top 5 that go into the prompt.

Multi-modal RAG retrieves images, tables, and charts alongside text. Agentic RAG lets the LLM decide for itself when to retrieve, what to retrieve, and whether to refine after the first batch. Instead of a fixed loop, the model has a "search" tool it can call multiple times. Agentic RAG is more expensive per query but handles open-ended research better than the classical pipeline. ChatRaj's production stack uses hybrid retrieval with a reranker on top, which is enough for almost every customer-facing chatbot use case.

What RAG is not

A few common misconceptions are worth clearing up.

RAG does not make the LLM smarter. It gives the model access to text it did not previously know about. The model's reasoning ability and general intelligence are unchanged.

RAG is not the same as a search engine. A search engine returns a list of documents and lets the user read them. RAG returns a synthesized answer with citations. A chatbot that just returns search results is not what most users want when they ask a question.

RAG is not a substitute for clean source content. If your knowledge base is contradictory or out of date, RAG will surface those contradictions in answers. The first step in any RAG project is content you would be comfortable having a human read aloud.

RAG is not a replacement for fine-tuning when fine-tuning is the right tool. For teaching a specific voice, a domain-specific output format, or a task the base model is bad at, fine-tuning is the right answer. RAG and fine-tuning are complementary; the most sophisticated systems use both.

Install guide

How to apply RAG to your own data

7 steps. Most operators finish in 60 seconds.

  1. Pick a source corpus and clean it

    Decide which body of text the LLM should answer from. A help center, product docs, a knowledge base. Then clean it: remove duplicates, archive outdated pages, and confirm the content is accurate. RAG surfaces whatever is in the corpus, so contradictions and stale content become user-visible mistakes.

  2. Chunk the corpus into retrieval-sized passages

    Split each document into chunks of roughly 300 to 800 tokens with 10 to 20 percent overlap between adjacent chunks. Smaller chunks improve retrieval precision; larger chunks preserve more context. Most production systems start at 500 tokens with 50 tokens of overlap and tune from there.

  3. Embed every chunk into a vector index

    Run each chunk through an embedding model (BGE, OpenAI text-embedding-3, Cohere embed, Voyage, etc.) and store the resulting vectors in a vector database. Pinecone, Weaviate, Qdrant, pgvector, and Postgres with the vector extension are all reasonable choices. Pick whichever you already operate.

  4. Build a BM25 index alongside the vector index

    Index the same chunks in a keyword search engine (Elasticsearch, Meilisearch, Postgres full-text, Tantivy). Vector-only retrieval is bad at exact-term matching; BM25 fills that gap. Hybrid retrieval consistently beats either system alone on real-world queries.

  5. Implement the hybrid retrieval pipeline

    On each query, embed the question and retrieve the top 20 from the vector index. In parallel, retrieve the top 20 from BM25. Fuse the two ranked lists with Reciprocal Rank Fusion: score = sum(1 / (k + rank_in_system)) across both systems with k=60. Keep the top 5 to 10 after fusion.

  6. Augment the prompt and call the LLM

    Paste the retrieved passages into a prompt template that instructs the model to answer using only those passages, to cite each claim by passage number, and to say it does not know when the passages do not contain the answer. Send the prompt to the LLM; stream the response to the user.

  7. Verify citations and monitor the failure surface

    Log every query, every retrieved passage set, and every generated answer. Sample regularly to confirm citations actually support the claims they cite. Track an Unanswered metric (queries where the system declined to answer or where the user thumbs-downed the response) and feed that signal back into content gaps in your source corpus.

ChatRaj on RAG

Three ways to make an LLM use your data

When each approach makes sense.

The plugin approach

Other RAG chatbot tools

Typical when you install a WordPress plugin, Shopify app, or third-party chatbot widget.

  • Updates knowledge at query time: Fine-tuning: no, must retrain. Prompt engineering: only if you change every prompt.
  • Works for knowledge bases larger than the context window: Fine-tuning: yes, but training cost grows. Prompt engineering: no.
  • Per-query cost: Fine-tuning: low (model already knows). Prompt engineering: high (long context every time).
  • Setup cost: Fine-tuning: high (training run + eval). Prompt engineering: very low.
  • Produces verifiable citations: Fine-tuning: no, knowledge baked into weights. Prompt engineering: yes, by quoting the prompt.
  • Teaches the model a new style or format: Fine-tuning: yes. Prompt engineering: partially via few-shot examples.
  • Handles real-time data (news, prices, status): Fine-tuning: no. Prompt engineering: yes, if you fetch and inline.
  • Lets non-engineers update knowledge: Fine-tuning: no, requires ML pipeline. Prompt engineering: yes, by editing the prompt.
  • Failure mode when knowledge is missing: Fine-tuning: silently hallucinates. Prompt engineering: silently hallucinates.
  • Best for: Fine-tuning: style, format, narrow tasks. Prompt engineering: small static knowledge, prototypes.
  • Common combination: Fine-tune for style + RAG for facts is a common pattern in production.
  • Maintenance burden: Fine-tuning: retrain when knowledge changes. Prompt engineering: edit prompt by hand.
The ChatRaj approach

One script tag. Everything bundled.

Hosted, configured, and maintained by us. You add a single line to your site.

  • Updates knowledge at query time: RAG: yes, just reindex changed documents.
  • Works for knowledge bases larger than the context window: RAG: yes, retrieval scales independently of context window.
  • Per-query cost: RAG: moderate (only retrieved passages in context).
  • Setup cost: RAG: moderate (chunk, embed, index, retrieve).
  • Produces verifiable citations: RAG: yes, citations link to retrieved passages.
  • Teaches the model a new style or format: RAG: no, RAG is about facts not behavior.
  • Handles real-time data (news, prices, status): RAG: yes, retrieval can hit live indexes.
  • Lets non-engineers update knowledge: RAG: yes, just edit source documents.
  • Failure mode when knowledge is missing: RAG: can be configured to say 'I do not know'.
  • Best for: RAG: large and changing knowledge bases with citations.
  • Common combination: Most serious systems use RAG as the primary technique.
  • Maintenance burden: RAG: automatic reindex on source change.
FAQ: RAG

Common RAG questions

Retrieval-Augmented Generation. The name is a literal description of the three steps: retrieve relevant passages from a knowledge base, augment the user's prompt with those passages, and generate the answer with an LLM.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML