ChatRaj
Architecture & chunking

Document chunking

Document chunking is the step in a RAG pipeline that splits long source documents into shorter passages so each one fits an embedding model's context window and so retrieval can surface the exact span that answers a question.

Bottom line
Document chunking is the step in a RAG pipeline that splits long source documents into shorter passages so each one fits an embedding model's context window and so retrieval can surface the exact span that answers a question. Strategy, chunk size, and overlap quietly determine a large share of final answer quality.
Reviewed by ··5 min read
Jump to section

What document chunking actually is

Document chunking is the preprocessing step that turns a long document, a 40 page PDF, a knowledge base export, a transcript, into a list of shorter passages that you can embed and store one by one. Each passage has to fit inside the embedding model's context window, and each one has to be small enough that the retriever can return just the relevant span rather than a wall of text wrapped around it.

The name suggests something mechanical, like slicing bread, and in the simplest form that is exactly what it is. But the choice of where to cut and how big each slice gets is one of the most undervalued tuning knobs in retrieval-augmented generation. Two stacks with the same embedding model, the same vector database, and the same LLM can produce noticeably different answer quality based on chunking alone. It is the quiet quality lever most teams set once during their first prototype and never revisit.

The four chunking strategies you should know

There are four chunking patterns that cover almost every production RAG pipeline you will encounter.

  1. Fixed-size chunking. Cut every N tokens. Simple, fast, predictable on cost, and easy to reason about. The cost is that boundaries fall in the middle of sentences and sometimes in the middle of words, which dilutes the embedding signal and occasionally orphans the actual answer across two chunks.
  2. Recursive character splitting. This is the LangChain default and the practical baseline for most teams. The splitter walks a list of separators from coarsest to finest, paragraphs, then newlines, then sentences, then spaces, then characters, and falls back to the next level only when a chunk is still too large. The result respects natural document structure most of the time, with predictable size as the safety net.
  3. Semantic chunking. Embed each sentence, then split where the embedding similarity between adjacent sentences drops below a threshold. Boundaries land where the topic actually changes rather than where a token count runs out. Indexing is slower because you embed twice, once to find boundaries and once to index, but the boundary quality often shows up in retrieval recall.
  4. Late chunking. Introduced by Günther et al. at Jina AI in 2024. Instead of chunking the text and then embedding each chunk in isolation, you embed the full document with a long context embedding model first, producing per-token contextualized vectors, then pool those token vectors into chunk embeddings. Each chunk's vector carries information from the surrounding document, which preserves global context that traditional chunking destroys.

A useful mental model: fixed and recursive split the text. Semantic and late split with information about meaning. The first pair is cheap; the second pair pays a one time indexing cost in exchange for permanent retrieval quality gains.

Why chunking matters for AI chatbots

The chatbot does not see your documents. It sees the chunks your retriever returned. If chunking is wrong, the bot is reasoning over the wrong evidence even when the answer was sitting somewhere in your corpus the whole time.

Two failure modes recur in practice. Too small and the chunk lacks the context that makes it interpretable, the bot sees "30 days" but not what the 30 days refers to. Too big and the chunk's embedding becomes a smear of multiple topics, dense retrieval ranks it lower than a tightly focused chunk on a less relevant topic, and the answer never surfaces. The right chunk size is the size where one chunk equals roughly one question's worth of context.

Overlap, also called chunk stride, is the second knob and the one most teams forget. If a key sentence falls right on a chunk boundary, half of it ends up in chunk A and half in chunk B, and neither chunk is a strong match on its own. Adding 10 to 20 percent overlap between chunks means boundary content lives in both neighbors, so at least one of them carries the full thought through to retrieval.

ChatRaj uses recursive character splitting with roughly 500 token chunks and roughly 50 token overlap as a default for general website content. That target sits in the band most retrieval research has converged on for question answering over knowledge bases, and it works well across the mix of pages, blog posts, and documentation that a typical site indexes.

Picking a chunk size: 200, 500, or 1000 tokens?

There is no universal best chunk size, but there are sensible defaults by use case.

  • 100 to 200 tokens for FAQ style answers, short policies, and structured Q&A content. Each chunk is roughly the size of one answer, which makes retrieval feel almost exact.
  • 200 to 500 tokens for general knowledge base question answering. This is the sweet spot for most chatbot use cases and the band LangChain, LlamaIndex, and Pinecone all recommend as a starting point.
  • 800 to 1000 tokens for summarization-heavy retrieval, long form policy documents, or any case where the model needs more surrounding context to answer correctly.

The other dimension is overlap. Ten to twenty percent of the chunk size is the well worn default. For a 500 token chunk that lands at 50 to 100 tokens of stride, which is enough to keep boundary content alive without bloating the index.

Two failure modes worth naming. PDF tables almost always need a custom splitter because row and column structure carries the meaning and a generic recursive splitter mangles it. And dialog or transcript content benefits from speaker-aware boundaries rather than raw token counts, because cutting in the middle of a turn destroys the back and forth that makes the chunk useful.

Most teams ship version one with whatever default LangChain or LlamaIndex hands them and never tune it. Spending an afternoon on chunking, an A/B between recursive at 500 and semantic at variable size, is one of the highest leverage investments you can make in retrieval quality before you reach for a reranking layer.

FAQ

Common Document chunking questions

It depends on the use case. 200 to 500 tokens is the sweet spot for question answering over knowledge bases, 800 to 1000 for summarization-heavy retrieval, and 100 to 200 for FAQ style answers where each chunk is roughly one answer.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML