ChatRaj
Answer

How to train an AI chatbot on PDF documents

The honest 2026 walkthrough. Why scanned PDFs break naive uploads, how to chunk for clean retrieval, and a 5-step recipe using ChatRaj as the concrete example.

Read the steps
Bottom line
Training an AI chatbot on PDFs in 2026 is a 5-step recipe. Audit your PDFs (text vs scanned). Extract text with a parser like pdf-parse or, for scans, OCR via Azure Document Intelligence or Tesseract. Clean and chunk at roughly 500 tokens with 50-token overlap. Embed and index with a model like text-embedding-3-small. Then test retrieval with real visitor questions and fix gaps.
Reviewed by ··11 min read
Jump to section

What "training a chatbot on PDFs" actually means in 2026

Before the recipe, a quick vocabulary fix. When a vendor says you can train an AI chatbot on a PDF in two minutes, no large language model is being retrained. Training an LLM costs millions of dollars in GPU time. What every mainstream chatbot product does in 2026 is a workflow called retrieval-augmented generation, or RAG. The PDF gets ingested into a private index. At question time, the most relevant slices of that PDF are retrieved and passed to a frozen LLM as context, with instructions to answer using only those slices and to cite the source page.

The LLM never sees the whole PDF. It sees three to ten chunks of it, scored as most relevant to the visitor's question. That is why answer quality on a PDF chatbot depends almost entirely on how cleanly the PDF was extracted, chunked, and indexed. Everything else (which LLM, which UI, which pricing tier) is a smaller lever than the extraction quality of the source.

This piece is about getting that extraction step right, because most "how to train on PDFs" tutorials skip straight from upload to magic and leave you stuck the first time you hand the bot a scanned contract.

Text-layer vs scanned PDFs (and why this changes everything)

The single most important property of a PDF, from a chatbot's point of view, is whether it has a text layer.

A text-layer PDF is one where the words inside the document are stored as actual text characters that any parser can read. PDFs generated from Word, Google Docs, LaTeX, web-to-PDF, or any digital authoring tool almost always have a text layer. If you can select text in the PDF with your cursor and copy it, the text layer is there.

A scanned PDF is the opposite. It is a stack of page images, usually from a flatbed scanner, a phone camera, or a fax archive. There is no machine-readable text inside the file. When you try to select text, your cursor draws a selection rectangle but copies nothing useful. Naive parsing libraries open a scanned PDF, find no text layer, and return an empty string. Your chatbot then confidently tells visitors "I do not have information on that topic," because as far as the index is concerned, the PDF is empty.

The fix for scanned PDFs is optical character recognition (OCR). OCR converts page images to text by running each page through a vision model that recognises characters and reassembles them into reading order. OCR is slower, costs money on hosted services, and has accuracy limits on weird layouts. The first step of any PDF training workflow in 2026 is therefore the audit step: is this PDF text-layer, scanned, or a mix?

A "mix" PDF is more common than people expect. A merged investor deck that combines slides exported from Keynote (text layer) with scanned pages of a signed contract (no text layer) is one document with two extraction strategies. Good tools detect this and apply OCR only on the pages that need it.

Step 1: audit your PDFs (text vs scanned, page count, language)

Spend ten minutes before you upload anything. For each PDF, note three things.

First, the text-layer test. Open the PDF, try to select a paragraph, and paste it into a notepad. If readable text appears, it is a text-layer PDF. If garbled boxes or nothing pastes, it is scanned. If some pages copy and others do not, it is mixed.

Second, the page count and visual layout. A 12-page product brochure with one column of text is trivial. A 480-page reference manual with footnotes, sidebars, and three-column layouts is hard. Multi-column layouts trip up naive extractors that read left-to-right across columns and produce sentence salad.

Third, the language. English is a solved problem for most parsers. Indic, CJK (Chinese, Japanese, Korean), Arabic, and right-to-left scripts need parsers and OCR engines that explicitly support them. Mixed-script PDFs (English headings with Hindi body text, for example) need language detection per chunk.

Write this down in a spreadsheet with one row per PDF. The audit takes longer than the upload, and skipping it is the most common reason PDF chatbots fail.

Step 2: extract text (text-layer parsing OR OCR)

For text-layer PDFs in Node.js, the two leading options in 2026 are pdf-parse and unpdf. pdf-parse is the long-running standard, wraps Mozilla's pdf.js, returns plain text plus metadata, and has roughly 2 million weekly downloads. unpdf is the newer UnJS-maintained alternative built for modern runtimes (Node, edge, browser) and is the right choice if you need to parse PDFs inside a Cloudflare Worker or a Next.js edge route. For Python workflows, pdfminer.six and PyMuPDF (fitz) are the equivalents, with PyMuPDF generally winning on multi-column layouts.

For scanned PDFs you need OCR. Three honest options in 2026:

Tesseract is the open-source baseline. Free, runs locally, supports 100+ languages out of the box. Tesseract is fine on clean single-column scans of printed text and weak on multi-column or low-resolution input. You will spend more time tuning Tesseract than you expect.

Azure Document Intelligence is the paid commercial option from Microsoft. The basic Read tier costs $1.50 per 1,000 pages and handles general OCR. The Layout and prebuilt models cost $10 per 1,000 pages and add structured table and form extraction. Free tier covers the first 500 pages per month. Accuracy on multi-column and mixed-layout PDFs is significantly better than Tesseract.

Unstructured.io ships a parsing pipeline specifically designed for LLM ingestion. It handles PDFs, DOCX, HTML, and email; detects layout elements (titles, paragraphs, tables, lists); and can route page images through its own OCR or third-party engines. Open-source library plus a paid hosted API.

The decision rule is straightforward. Text-layer PDFs in English: pdf-parse or unpdf, free. Scanned English PDFs at low volume: Tesseract, free, accept some accuracy loss. Scanned PDFs at production volume, or any document with tables you must preserve: Azure Document Intelligence or Unstructured.io.

Step 3: clean + chunk

After extraction, raw text needs cleaning. Strip repeated page headers and footers ("Acme Corp Confidential", page numbers, the date the PDF was generated). If you skip this, every chunk in your index ends with the same boilerplate, and retrieval scores get polluted because boilerplate matches almost any query weakly. Most parsers do not strip headers and footers automatically; you write a small post-processor that detects strings repeating on every page and removes them.

Then chunk. Industry default in 2026 is roughly 500 tokens per chunk with 50 tokens of overlap between adjacent chunks. The 500-token target is small enough to keep chunks topically focused and large enough to preserve enough context that the LLM can answer without stitching fragments. The 50-token overlap means a sentence cut at the boundary of one chunk still appears whole at the start of the next, so retrieval does not lose context at chunk seams.

Tables deserve special handling. A table that gets chunked mid-row becomes useless: half the columns end up in chunk A and half in chunk B, and the LLM cannot reassemble them. Two practical approaches. The first is to linearize the table row by row before chunking ("Product: X, Price: 19, Stock: 240"). The second is to keep markdown table syntax intact and treat each table as a single chunk regardless of token count. Modern parsers like Unstructured and Azure Document Intelligence detect tables explicitly and let you handle them as separate elements.

Footnotes and sidebars are similar. A naive extractor interleaves a paragraph and its footnote in reading order and produces nonsense. Layout-aware extractors detect footnotes and either drop them or attach them as separate elements.

ChatRaj's Sources page accepts PDF uploads directly. We parse the text layer with pdf-parse, chunk at roughly 500 tokens with overlap, then embed via the hybrid retrieval pipeline. For scanned PDFs we recommend running OCR upstream and uploading the resulting text-layer PDF, because high-quality OCR is a different engineering investment than retrieval and we would rather you pick the right engine for your document mix.

Step 4: embed + index

Each cleaned chunk gets sent to an embedding model, which returns a fixed-length vector that captures the chunk's meaning. The intuition is that two chunks expressing similar ideas produce vectors that point in similar directions, even when they share no words.

For English-dominant content in 2026, OpenAI's text-embedding-3-small is the default workhorse: cheap, fast, 1536-dimensional vectors, strong general performance. text-embedding-3-large is the higher-quality option when you can afford the cost. For non-English or multilingual PDFs, choose a multilingual embedding model explicitly: Cohere's multilingual embeddings or BAAI's bge-multilingual variants are common 2026 picks. Using an English-only model on a Hindi or Japanese PDF produces vectors that point in random directions, and retrieval quality collapses.

The vectors and their original chunk text get stored in a vector database. Pinecone, Weaviate, Qdrant, and pgvector inside Postgres are the common choices. For chatbot products you do not pick the database; the platform picks for you.

Hybrid retrieval is the additional lever that separates a good PDF chatbot from a mediocre one. A pure semantic retriever is great at conceptual matches and weak at exact-term lookups (model numbers, SKUs, legal clause references). A hybrid retriever runs both vector similarity and BM25 keyword search, then fuses the ranked lists. Reciprocal Rank Fusion is the standard fusion algorithm. On real PDF corpora that mix prose and identifiers (product manuals, legal documents, technical specs), hybrid retrieval consistently outperforms semantic-only.

Step 5: test retrieval with real questions

The last step is the one most operators skip and the one that catches every problem the first four steps did not.

Write 20 questions a real reader would ask of this PDF. Mix categories: 5 factual lookups ("what is the maximum operating temperature"), 5 procedural ("how do I reset the device"), 5 conceptual ("why does the warranty exclude X"), and 5 edge cases that the PDF probably does not cover. Including questions the bot should not be able to answer is as important as the ones it should: you need to verify it says "I do not know" gracefully instead of hallucinating.

Ask all 20 through the visitor-facing widget. Save each answer with its cited page number. Grade on a three-point scale: 2 = correct and well-cited, 1 = partial, 0 = wrong or hallucinated. Score out of 40. Above 32 is production-ready. Between 24 and 32 indicates fixable chunking or content issues. Below 24 means the extraction step failed and you should re-audit the PDF.

The test set becomes a permanent regression suite. Re-run it any time you update the PDF or change retrieval settings.

Common failure modes and how to fix them

Five problems show up repeatedly on PDF chatbot audits.

Headers and footers polluting every chunk. Strip them at the cleaning step, not at retrieval time. If you cannot strip them, set a higher retrieval threshold so weakly-matched boilerplate chunks fall below the cutoff.

Multi-column layouts read in wrong order. Naive parsers stream text left-to-right across the page width and merge columns into nonsense. The fix is a layout-aware parser: PyMuPDF, Unstructured.io, or Azure Document Intelligence Layout model.

Tables broken across chunks. Linearize row-by-row or keep markdown tables as single chunks. Never let a chunker split a table mid-row.

Footnote interleaving. Use a layout-aware parser that detects footnotes as separate elements. Or strip footnotes entirely if they are not load-bearing for your use case.

Embedded images with caption text only. PDFs sometimes contain diagrams whose only readable content is a one-line caption. Your chatbot cannot answer questions about the diagram itself unless you run image-to-text on the image. For most chatbot use cases, captions plus surrounding paragraph text are enough.

What we deliberately did not cover (image-heavy PDFs, math-heavy STEM)

Two cases are out of scope for this guide.

Image-heavy PDFs (architectural drawings, medical imaging, photo-driven design portfolios) need vision-language models, not OCR plus retrieval. The 2026 pattern is to run each page through a multimodal model like GPT-4o or Claude Opus 4.7 at indexing time, save a structured description, and embed the description. This is a different pipeline and a different cost structure.

Math-heavy STEM PDFs (research papers, textbooks, technical specifications full of LaTeX) lose information catastrophically through naive OCR. Equations become character soup. The 2026 solution is to detect math regions with a layout model, OCR them through a specialised math model like Mathpix or Nougat, store the LaTeX source, and embed the LaTeX alongside the surrounding prose.

Both cases warrant a separate guide. For everything else (manuals, handbooks, contracts, FAQs, policy documents, product spec sheets) the five-step recipe above is the right starting point and produces a production-grade PDF chatbot in an afternoon of focused work.

Install guide

5 steps to ingest PDFs cleanly

5 steps. Most operators finish in 60 seconds.

  1. Audit each PDF (text-layer vs scanned, page count, language)

    Open each PDF and try to select a paragraph. If text copies cleanly, it is a text-layer PDF and parses for free. If nothing copies, it is scanned and needs OCR. Record page count and language. A spreadsheet with one row per PDF takes ten minutes and prevents the most common failure mode: uploading a scanned PDF, getting an empty index, and blaming the chatbot.

  2. Extract text with pdf-parse, or OCR for scans

    For text-layer PDFs use pdf-parse or unpdf in Node.js (PyMuPDF or pdfminer.six in Python). For scanned PDFs pick an OCR engine: Tesseract for free local use, Azure Document Intelligence at $1.50 per 1,000 pages for Read or $10 per 1,000 pages for Layout, or Unstructured.io for layout-aware extraction tuned for LLM ingestion.

  3. Clean and chunk at ~500 tokens with 50-token overlap

    Strip repeating headers and footers before chunking, or every chunk in your index ends in boilerplate. Chunk at roughly 500 tokens per chunk with 50 tokens of overlap so context does not get cut at boundaries. Treat tables as single chunks (or linearize row-by-row); never let a chunker split a table mid-row.

  4. Embed each chunk and index with hybrid retrieval

    Send each chunk through an embedding model: text-embedding-3-small for English, a multilingual model like Cohere multilingual or bge-multilingual for non-English PDFs. Store the vectors in your vector database alongside the chunk text. Enable hybrid retrieval (vector plus BM25 keyword, fused with Reciprocal Rank Fusion) for PDF corpora that mix prose with identifiers and SKUs.

  5. Test retrieval with 20 real questions and grade the answers

    Write 20 questions a real reader would ask of this PDF, mixed across factual, procedural, conceptual, and edge cases. Ask all 20 through the visitor widget. Grade each on accuracy (2 = correct, 1 = partial, 0 = wrong) and citation quality (right page cited). Score out of 40. Above 32 ships. Below 24 means re-audit the PDF and re-extract.

ChatRaj on PDF training

pdf-parse vs Azure Document Intelligence vs Unstructured.io

Three serious options for turning PDFs into chatbot-ready text. Pick by what your PDFs actually look like, not by what is fashionable.

The plugin approach

Other PDF training chatbot tools

Typical when you install a WordPress plugin, Shopify app, or third-party chatbot widget.

  • Text-layer PDF support: pdf-parse: yes, primary use case. Returns plain text plus metadata via pdf.js.
  • OCR for scanned PDFs: pdf-parse: no, returns empty string on image-only pages.
  • Table extraction quality: pdf-parse: weak, tables flatten into reading-order text.
  • Multi-column layout handling: pdf-parse: weak on three-column layouts, can interleave columns.
  • Language support: pdf-parse: any language present in the text layer.
  • Cost per 1,000 pages: pdf-parse: free, open source.
  • Runtime support: pdf-parse: Node.js. unpdf is the modern alternative for edge runtimes.
  • Footnote and sidebar handling: pdf-parse: poor, interleaves with paragraph text.
  • Best use case: pdf-parse: text-layer English PDFs, simple layouts, free workflow.
  • ChatRaj default pipeline: Hybrid: pdf-parse for text-layer uploads; OCR run upstream when you have scans.
The ChatRaj approach

One script tag. Everything bundled.

Hosted, configured, and maintained by us. You add a single line to your site.

  • Text-layer PDF support: Azure DI: yes, but using OCR pricing for what could be free. Unstructured: yes, layout-aware.
  • OCR for scanned PDFs: Azure DI: yes, $1.50 per 1,000 pages Read tier. Unstructured: yes, via integrated OCR.
  • Table extraction quality: Azure DI: strong, prebuilt Layout model returns structured tables. Unstructured: strong, table elements detected explicitly.
  • Multi-column layout handling: Azure DI: strong, layout-aware reading order. Unstructured: strong, layout-aware.
  • Language support: Azure DI: 100+ languages including Indic and CJK. Unstructured: depends on chosen OCR backend.
  • Cost per 1,000 pages: Azure DI: $1.50 (Read) to $10 (Layout); free for first 500 pages per month. Unstructured: free OSS or paid hosted API.
  • Runtime support: Azure DI: HTTPS API, any runtime. Unstructured: Python library plus REST API.
  • Footnote and sidebar handling: Azure DI: detects as separate paragraphs. Unstructured: classifies as element types.
  • Best use case: Azure DI: scanned production-volume PDFs, tables matter. Unstructured: LLM ingestion pipelines, mixed file types.
  • ChatRaj default pipeline: Pro plan unlocks larger upload caps; Free plan covers light PDF training.
FAQ: training a chatbot on PDFs

Common PDF training questions

Run OCR before uploading. Tesseract is the free open-source option and is fine on clean single-column English scans. Azure Document Intelligence costs $1.50 per 1,000 pages for the Read tier and significantly outperforms Tesseract on multi-column and mixed-layout documents (free for the first 500 pages per month). Unstructured.io ships a layout-aware pipeline that detects tables and headings while it OCRs. Save the result as a new text-layer PDF and upload that to your chatbot.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML