ChatRaj
LLM internals

Tokenization

Tokenization is the step where a language model converts raw text into integer IDs drawn from a fixed vocabulary of subword units.

Bottom line
Tokenization is the step where a language model converts raw text into integer IDs drawn from a fixed vocabulary of subword units. Modern systems use Byte-Pair Encoding (BPE) or SentencePiece to build vocabularies of 30,000 to 200,000 tokens. Every API charge and every context limit you read about is counted in these tokens, not in words or characters.
Reviewed by ··5 min read
Jump to section

What tokenization actually is

Tokenization is the bridge between human text and the numeric tensors a neural network actually processes. A tokenizer splits a string into pieces called tokens, then maps each piece to an integer ID drawn from a fixed vocabulary that typically holds 30,000 to 200,000 entries. The model never sees your letters. It sees a sequence of IDs, looks each one up in an embedding table, and operates on the resulting vectors.

Tokens are usually subwords rather than whole words or single characters. The word "tokenization" might split into "token" and "ization", while a rare proper noun gets broken into three or four fragments. Common English words often map to a single token. Whitespace, punctuation, and even leading spaces are encoded as part of the token, which is why "hello" and " hello" can be different IDs in the vocabulary.

This subword approach is the compromise that makes modern large language models work. Word-level vocabularies cannot represent anything they did not see during training. Character-level vocabularies turn a paragraph into hundreds of tokens and waste model capacity. Subword tokenization gives the model a fixed vocabulary while still being able to encode any string, including typos, code, and languages it has barely seen.

How BPE and SentencePiece work

Byte-Pair Encoding (BPE) was adapted for neural machine translation by Sennrich, Haddow, and Birch in 2015. The training procedure starts with every character as its own token, counts which adjacent pair appears most often in the training corpus, and merges that pair into a single new token. The merge step repeats thousands of times until the vocabulary reaches the target size. Each merge becomes a rule the tokenizer applies at inference time, building up subwords from characters in a deterministic order.

GPT-2 introduced byte-level BPE, where the alphabet is the 256 raw bytes instead of Unicode characters. This makes the encoding lossless and reversible for any input, including emoji, code, and arbitrary binary. OpenAI's tiktoken library packages the byte-level BPE used by ChatGPT and GPT-4 family models. GPT-4o introduced a new encoding called o200k_base with roughly 200,000 tokens, doubling the size of the cl100k_base vocabulary used by GPT-4 and improving compression for non-English languages.

SentencePiece, released by Google in 2018, is a self-contained alternative that treats the entire input as a raw byte stream and learns a vocabulary directly, no pre-tokenization step required. It can use either BPE or the unigram language model algorithm under the hood. Llama 2 and Gemma ship SentencePiece tokenizers; Llama 3 switched to tiktoken-style BPE with a 128,256-token vocabulary, a roughly fourfold expansion from Llama 2 that improved multilingual coverage and reduced per-token cost at deployment.

Why tokenization matters for AI chatbots

If you operate a chatbot, tokenization is the silent line item under every cost and capacity decision you make. The context window limit advertised by a provider is measured in tokens, not characters. The per-million-token price on the pricing page is what shows up on your invoice. Latency scales with token count because each generation step produces one token at a time.

The rough rule of thumb for English is one token per four characters, or about three tokens per four words. A 500-word answer is roughly 650 to 700 tokens. A 10,000-character knowledge base chunk is roughly 2,500 tokens. These are averages and individual passages drift in both directions, but the ratio is stable enough to budget against.

Non-English text is where the math breaks. Older tokenizers like GPT-3's encoding compressed Devanagari, Arabic, Thai, and CJK scripts poorly, sometimes using two or three tokens per character. The same paragraph that costs 200 tokens in English might cost 800 in Hindi. GPT-4o's o200k_base and Llama 3's 128k vocabulary significantly closed this gap by allocating more vocabulary slots to non-Latin scripts, but the asymmetry has not disappeared. If you serve a multilingual audience, audit your token usage per language before committing to a model.

ChatRaj counts tokens at retrieval time so each visitor query stays under model context limits. When a user asks a question, the system tokenizes the query, retrieves the top relevant chunks from your indexed content, and trims the assembled prompt to leave enough headroom for the model's response, so you never hit a truncation error mid-answer.

Counting tokens, billing, and context limits

The tiktoken library is the standard way to count OpenAI tokens before sending a request. Calling tiktoken.encoding_for_model("gpt-4o") returns the o200k_base encoder, and encoder.encode(text) gives you the exact integer list that the API will see. Hugging Face's tokenizers library does the equivalent job for open-weight models, including Llama 3, Mistral, and Gemma. Anthropic and Google expose their own counting endpoints.

Two practical consequences for builders. First, tokenization differs across providers, so the same prompt does not cost the same number of tokens on GPT-4o, Claude, and Llama 3. Switching models is not just a price comparison; it is a recount. Second, your input tokens include the system prompt, the retrieved context, the conversation history, and the user's current message. The output tokens are what the model generates. Most providers charge different rates for each, and exceeding the context window truncates from the start, silently dropping your system prompt before you notice.

Token counting is not glamorous, but it is the difference between a chatbot that stays within budget and one that surprises you on the monthly invoice.

FAQ

Common Tokenization questions

For OpenAI models, install the tiktoken Python library and call encoding_for_model with the model name, then encode your text. For Llama and Mistral, use Hugging Face's tokenizers library with the model's published tokenizer.json. Anthropic and Google also expose token-count endpoints.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML