ChatRaj
LLM internals

Context window

A context window is the hard cap on how many tokens a large language model can process in a single inference call.

Bottom line
A context window is the hard cap on how many tokens a large language model can process in a single inference call. It covers everything: system prompt, retrieved passages, conversation history, the user message, and the generated output. In 2026 the ceiling has reached 1M to 2M tokens, but bigger windows do not automatically mean better answers.
Reviewed by ··5 min read
Jump to section

What a context window actually is

A context window is the maximum number of tokens a large language model can ingest and produce in one inference call. The number you see quoted by vendors (200k, 1M, 2M) is the total budget for that single forward pass. Everything the model can "see" or "say" in that call lives inside it.

Concretely, the window holds the system prompt, every retrieved document chunk a RAG pipeline injects, the full conversation history replayed each turn, the current user message, and the response the model is about to generate. The instant a request crosses the limit, the API returns an error or silently truncates the oldest tokens, depending on the provider.

It is important to be precise about units. The window is measured in tokens, not characters or words, and the token count depends on the model's tokenizer. See tokenization for why "ChatGPT" might be one token and "ChatRaj" might be three. A rule of thumb for English is that 1,000 tokens equals roughly 750 words, but this breaks down badly for code, JSON, non-Latin scripts, and emoji.

What counts toward the context limit

Every API request bundles the following into the window:

  1. System prompt. Persona, guardrails, output format instructions, tool schemas. Often 500 to 5,000 tokens before you have written a single user message.
  2. Tool and function definitions. Each tool's JSON schema is serialized into the prompt. A chatbot with 20 tools can easily spend 8,000 tokens just describing them.
  3. Retrieved passages. In a RAG setup, the top-k chunks pulled from a vector database. Typically 2,000 to 15,000 tokens depending on chunk size and k.
  4. Conversation history. The full back-and-forth so far. This is the part that grows linearly and eats most production windows.
  5. Current user message. Usually the smallest piece.
  6. Reserved output budget. Most providers require you to reserve space for the model's response. If your window is 200,000 and you ask for max_tokens of 4,096, the input cannot exceed roughly 195,904 tokens. Output and input share the same pool.

Cost and latency scale roughly linearly with the input token count. A 500k-token prompt is not just more expensive; it also takes noticeably longer to process before the first token streams back, because the model has to compute attention over every input token. The KV cache is the engineering trick that makes this tolerable on multi-turn conversations, but it does not eliminate the underlying linear cost.

Why context window matters for AI chatbots

Window size dictates three things a chatbot operator cares about: how long a conversation can run before the model loses earlier context, how much knowledge base content you can stuff directly into the prompt, and how big a document a user can paste in for analysis.

The 2026 landscape sits in a comfortable place compared to 2023. Anthropic's Claude Sonnet and Opus models offer 200,000 tokens natively, with 1M-token contexts available for selected enterprise use cases. OpenAI's GPT-4.1 family ships with 1M tokens. Google's Gemini 2.5 family extends to 1M tokens for most tiers and 2M tokens for select configurations. Meta's Llama 3.3 70B runs at 128,000 tokens. Open-source models in the 8k to 32k range still dominate the long tail of self-hosted deployments.

For a customer-support chatbot this is more than enough headroom in almost every scenario. A 200k-token window can hold the equivalent of a 150,000-word knowledge base verbatim. A 1M-token window can hold most of a small company's entire docs site. The question is not whether you can fit everything; it is whether you should.

Bigger isn't always better: the "lost in the middle" problem

The most cited finding on long-context performance is Liu et al.'s 2023 paper "Lost in the Middle: How Language Models Use Long Contexts." The authors evaluated several state-of-the-art models on multi-document question answering and key-value retrieval tasks, placing the relevant information at different positions in a long input. The result was a striking U-shaped curve. Models attended strongly to content at the start of the context (primacy) and at the end (recency), but accuracy dropped sharply when the answer sat in the middle.

This effect has been re-validated repeatedly on newer 1M-token models. Vendors publish "needle in a haystack" benchmarks showing near-perfect single-fact retrieval, but real workloads, where the model must synthesize across multiple buried facts, still suffer middle-position degradation.

The practical consequence is that dumping a 500,000-token corpus into a 1M-token window is rarely the right move, even when the math says it fits. You pay more, you wait longer, and the model is statistically less likely to surface the buried passage it actually needs.

A focused retrieval pipeline beats a maximal prompt for most workloads. Retrieve the relevant 5,000 to 10,000 tokens of source material, place them near the end of the prompt where recency bias favors them, and let the model reason over a tight context. This is the philosophy behind modern RAG systems and the reason they remain dominant even as context windows balloon.

ChatRaj retrieves the most relevant 5 to 15 passages per query rather than stuffing the entire knowledge base into context. The savings are not just financial. They translate directly into faster first-token latency, higher answer accuracy, and easier debugging when something goes wrong.

It helps to keep three things separate. Tokenization decides what a token is; the context window decides how many of those tokens fit. The KV cache is the implementation trick that makes long contexts efficient in practice by storing attention key/value tensors so prior tokens do not need recomputation. The context window is a hard architectural limit. Tokenization is a preprocessing choice. KV caching is an inference optimization. Confusing the three leads to bad capacity planning.

FAQ

Common Context window questions

Google's Gemini 2.5 family ships up to 2,000,000 tokens on select configurations. OpenAI's GPT-4.1 family and Anthropic's Claude (in enterprise tiers) both offer 1,000,000-token windows. Most production deployments still run at 128,000 or 200,000 tokens because cost and latency favor smaller windows.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML