The plain-English RAG definition
Retrieval-Augmented Generation, almost always called RAG, is a way to make a large language model answer questions using a specific body of text that the model was never trained on. That body of text might be your help center, your product docs, six months of support tickets, or the archive of a research lab. RAG lets the LLM cite that material in its answer without anyone retraining the model.
The name describes the architecture exactly. At query time, a retrieval step pulls relevant passages out of your knowledge base. Those passages are augmented onto the user's prompt. The combined prompt is sent to a generation model (the LLM), which writes the final answer. Three steps: retrieve, augment, generate.
RAG exists because base LLMs have two problems that show up the minute you try to use them for real work. The first is hallucination: a model trained on the public internet has learned the statistical shape of confident-sounding answers, and will produce one even when it has no idea what the truth is. The second is the knowledge cutoff: every model is trained on a snapshot of text that ends on a specific date and has no idea what happened after, or what is true inside your private systems. RAG addresses both by giving the LLM the exact passages to use as evidence.
The technique was formalized in a 2020 NeurIPS paper by Patrick Lewis and colleagues at what was then Facebook AI Research, titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." That paper showed that combining a parametric model (the LLM) with a non-parametric memory (an external index of text) produced more factual and specific answers than the LLM alone. Six years later, RAG is the default pattern for building production AI assistants over private data.
The 3-step RAG architecture
Step one is retrieval. You take the user's question and run it against an index of your knowledge base. The index returns a small ranked list of passages, typically five to twenty. The retrieval mechanism is usually a vector database (semantic similarity), a keyword search engine (exact-term overlap), or both.
Step two is augmentation. You paste the retrieved passages into the prompt sent to the LLM. A typical augmented prompt: "Here are passages from our knowledge base. [PASSAGE 1] [PASSAGE 2] [PASSAGE 3]. Using only these passages, answer: [QUESTION]." That preamble tells the LLM what counts as ground truth (the passages) and what does not (its own pretraining memory).
Step three is generation. The LLM produces an answer. Production RAG systems ask the LLM to cite which passage each claim came from, so the user can verify the answer rather than trust it blindly. A RAG system that cannot show its sources is indistinguishable from a hallucination.
Production systems wrap more machinery around this loop: rewriting the question into a better retrieval query, deduplicating passages, reranking with a stronger model, and post-processing citations. But the three-step skeleton is always present.
How retrieval works in practice
The retrieval step is the part of RAG that gets the most engineering attention because it determines answer quality. If retrieval surfaces the wrong passages, generation cannot recover.
Retrieval starts with chunking. You cannot index a thousand-page document as one giant blob; the retrieval step needs to return short focused passages, not entire books. So you split each document into chunks of a few hundred tokens, with some overlap between adjacent chunks so a sentence that straddles a boundary does not get split in half. Choosing a good chunk size is more art than science. Too small and chunks lose context; too large and the LLM sees too much irrelevant text along with the relevant bit.
Each chunk gets turned into an embedding: a list of numbers (usually 384 to 3072 dimensions) that represents the meaning of the chunk. Embeddings are produced by a small neural network trained for this job. Open-source models like the BGE family from BAAI (Xiao et al., "C-Pack: Packed Resources For General Chinese Embeddings", 2023) and proprietary models from OpenAI, Cohere, and Voyage all produce vectors with the property that semantically similar passages end up near each other in vector space. The query gets embedded the same way; vector search finds the chunks whose embeddings are nearest by cosine similarity.
Pure vector search has a well-known failure mode: it is bad at exact-term matching. If a user asks "what is error code E-204" and the docs contain that literal string in one specific paragraph, vector search might return passages about general error handling instead. Keyword search handles this case naturally. BM25, described in "The Probabilistic Relevance Framework: BM25 and Beyond" (Robertson and Zaragoza, 2009), ranks passages by how rare and frequent each query term is. It is decades old and still the strongest baseline for exact-term retrieval.
Hybrid retrieval runs both systems and fuses the rankings. Reciprocal Rank Fusion, introduced by Cormack, Clarke, and Buttcher in a 2009 SIGIR paper, is the standard fusion algorithm: each document gets a score of 1 divided by (k plus its rank in each system), and the scores are summed across systems. RRF is one line of math and has the useful property of not caring about how each system's scores are calibrated. It just cares about ranks. Hybrid retrieval reliably outperforms either system alone on real-world queries.
A separate strand of retrieval research, late-interaction models like ColBERT (Khattab and Zaharia, SIGIR 2020), encodes each token of the query and the document independently and computes similarity at the token level rather than the document level. ColBERT trades index size for retrieval quality and is increasingly common in high-stakes RAG systems.
A worked example using ChatRaj
To make the pipeline concrete, here is what happens when a visitor asks a ChatRaj-powered chatbot "what is your refund policy."
When the operator first set up the bot, ChatRaj crawled the website. Each page was chunked into roughly 500-token passages with 50 tokens of overlap. Each chunk was embedded into a 768-dimensional vector and stored in a per-bot vector index, alongside a BM25 inverted index. The refund-policy page produced a chunk containing the literal sentence "We offer a full refund within 30 days of purchase."
When the visitor sends the question, the backend embeds the query into the same vector space. Vector search retrieves the top twenty chunks whose embeddings are most similar. In parallel, BM25 retrieves the top twenty chunks containing the exact terms "refund" and "policy." The two ranked lists are fused with Reciprocal Rank Fusion and the top five are kept.
Those five passages are pasted into a prompt template like: "Answer using only the passages below. Cite each claim with [Source N]. Passages: [1] We offer a full refund within 30 days of purchase. [2] ... [QUESTION] What is your refund policy?" The LLM generates: "We offer a full refund within 30 days of purchase [Source 1]. Refunds are processed to your original payment method within 5 business days [Source 2]." The visitor can click each [Source N] to verify the passage. If the question is outside the knowledge base, the LLM says it does not know rather than making one up. The pipeline runs in under a second.
RAG vs fine-tuning vs prompt engineering
RAG is one of three common techniques for getting an LLM to use specific information. The other two are fine-tuning (continuing the model's training on your data) and prompt engineering (stuffing the information directly into every prompt).
Fine-tuning bakes knowledge into the model's weights. It is the right tool for teaching a style, a tone, a domain-specific format, or a task the base model is not good at. It is the wrong tool for "give the model access to my changing knowledge base," because every update requires another fine-tuning run.
Prompt engineering stuffs everything into the context window. It works when the entire knowledge base fits in a single prompt and when you do not mind paying for that context on every call. It does not scale to knowledge bases larger than the context window, and it gets expensive fast.
RAG is the choice when the knowledge base is large, when it changes frequently, when you want citations, and when you want each query to pay only for the passages it actually needs. Almost every production AI chatbot in 2026 uses RAG for these reasons.
Common RAG failures (and how production systems mitigate them)
Three failure modes show up over and over. Knowing what they look like is most of the battle.
Retrieval irrelevance is when the retriever returns passages that are not actually relevant, and the LLM either makes something up or refuses to answer. Mitigations: hybrid retrieval, query rewriting, and a reranker that re-scores the top-N candidates with a stronger cross-encoder before passing them to the LLM.
Stale content is when the knowledge base has not been re-indexed since a document changed. Mitigations: automatic recrawling on a schedule, webhook-triggered reindex when the source CMS publishes, and a freshness signal so newer passages are preferred when two contradict.
Citation drift is when the LLM cites a passage but the cited passage does not actually support the claim. Mitigations: post-generation citation verification (a second model checks whether each cited passage entails the cited claim) and falling back to "I don't know" when verification fails.
A RAG system without these mitigations works in demos and fails in production. A RAG system with them is the boring, reliable architecture behind most of the AI assistants you actually trust.
RAG in 2026
The three-step pipeline has not changed since the 2020 Lewis paper, but the components have evolved fast. Hybrid retrieval is now the default; pure-vector retrieval is a starter setup. Serious systems combine BM25, dense vectors, and often a late-interaction model like ColBERTv2 (Santhanam, Khattab, Saad-Falcon, Potts, and Zaharia, 2022), fused via RRF. Rerankers are standard: after retrieval returns 50 or 100 candidates, a cross-encoder reranks them down to the top 5 that go into the prompt.
Multi-modal RAG retrieves images, tables, and charts alongside text. Agentic RAG lets the LLM decide for itself when to retrieve, what to retrieve, and whether to refine after the first batch. Instead of a fixed loop, the model has a "search" tool it can call multiple times. Agentic RAG is more expensive per query but handles open-ended research better than the classical pipeline. ChatRaj's production stack uses hybrid retrieval with a reranker on top, which is enough for almost every customer-facing chatbot use case.
What RAG is not
A few common misconceptions are worth clearing up.
RAG does not make the LLM smarter. It gives the model access to text it did not previously know about. The model's reasoning ability and general intelligence are unchanged.
RAG is not the same as a search engine. A search engine returns a list of documents and lets the user read them. RAG returns a synthesized answer with citations. A chatbot that just returns search results is not what most users want when they ask a question.
RAG is not a substitute for clean source content. If your knowledge base is contradictory or out of date, RAG will surface those contradictions in answers. The first step in any RAG project is content you would be comfortable having a human read aloud.
RAG is not a replacement for fine-tuning when fine-tuning is the right tool. For teaching a specific voice, a domain-specific output format, or a task the base model is bad at, fine-tuning is the right answer. RAG and fine-tuning are complementary; the most sophisticated systems use both.