ChatRaj
LLM internals

Large language model (LLM)

A large language model is a transformer-based neural network with billions to trillions of parameters, trained on web-scale text to predict the next token in a sequence.

Bottom line
A large language model is a transformer-based neural network with billions to trillions of parameters, trained on web-scale text to predict the next token in a sequence. The category includes GPT, Claude, Gemini, and Llama. Scale (parameters plus training tokens) is what makes an LLM behave qualitatively differently from a smaller language model.
Reviewed by ··5 min read
Jump to section

What an LLM actually is

A large language model is a neural network, built on the transformer architecture, that learns to predict the next token in a sequence. Give it the prefix "The capital of France is" and it returns a probability distribution over every token in its vocabulary, with "Paris" near the top. Sample one token, append it, repeat. That loop is the entire generation process.

The word "large" is doing real work in the name. Modern LLMs have billions to trillions of parameters (the numerical weights the model learns during training). In 2026, the open-weight tier spans roughly 7B to 405B parameters: Llama 3.3 ships as 8B and 70B variants, with Llama 3.1 still available at 405B. Mistral, Qwen, and DeepSeek release models in a similar range. Proprietary frontier models like Claude Opus 4, GPT-4.1, and Gemini 2.5 do not disclose parameter counts, but estimates put them in the hundreds of billions to low trillions, often using sparse mixture-of-experts so that only a fraction of weights activate per token.

Almost every major LLM in 2026 is decoder-only. That means the architecture is a stack of transformer blocks (self-attention plus feed-forward layers) reading left to right, with no separate encoder. GPT, Claude, Llama, and Gemini are all decoder-only or sparse mixture-of-experts variants of that pattern. A handful of research models experiment with state-space architectures (Mamba) or hybrid attention, but the decoder-only transformer is still the default shape.

How an LLM is trained (briefly)

Training happens in three rough stages.

1. Pretraining. The model reads trillions of tokens of raw text scraped from the web, books, code, and licensed corpora. The objective is simple: predict the next token. This is where the bulk of compute is spent and where general knowledge gets baked into the weights. The text is first run through a tokenization step that splits it into subword units the model can ingest.

2. Supervised fine-tuning (SFT). The pretrained model is a decent autocomplete engine but does not yet follow instructions politely. SFT continues training on a curated dataset of instruction-response pairs written by humans, teaching the model the format of a helpful answer. This is one form of fine-tuning.

3. Alignment. Either RLHF (reinforcement learning from human feedback) or constitutional AI, depending on the lab. Humans rank model outputs, a reward model learns the ranking, and the LLM is optimized against that reward. Anthropic's constitutional AI variant replaces some of the human labeling with model-generated critiques against a written constitution.

Compute requirements grow as a power law. The Chinchilla scaling laws (Hoffmann et al., 2022) showed that for compute-optimal training, model parameters and training tokens should scale at roughly equal rates: every doubling of model size should be matched by a doubling of training data. The Chinchilla paper trained a 70B model on 1.4 trillion tokens, a ratio of 20 tokens per parameter, and beat the much larger but undertrained Gopher (280B) and GPT-3 (175B) on most benchmarks. Most modern LLMs are now trained well past Chinchilla-optimal because inference cost matters more than training cost once you have millions of users.

Why LLMs matter for AI chatbots

An LLM is the engine of every modern AI chatbot. When a visitor types "Do you ship to Canada?" into a widget on your site, that question is wrapped in a prompt, fed to an LLM, and the model generates a response one token at a time. The chatbot product is the layer around the model: the widget UI, the context window management, the retrieval system that pulls relevant content from your site, and the routing logic that decides which model to call.

ChatRaj routes each user message through an LLM, then grounds the response in retrieved passages from your content so the answer cites your real pages instead of the model's pretraining memory. This is the retrieval-augmented generation pattern, and it exists because LLMs on their own hallucinate. They are excellent at fluent language but have no built-in mechanism to distinguish a real fact from a plausible-sounding invention. Retrieval gives the model evidence to ground its answer in.

The "large" in LLM matters here because smaller models below roughly 7B parameters tend to lose coherence on multi-turn customer-support conversations, especially when the retrieved context is long. The emergent capabilities that make instruction-following feel competent (chain-of-thought reasoning, format compliance, tool use) generally appear above certain compute thresholds. Below those thresholds you get a model that sounds fine on short prompts and falls apart on real workflows.

LLM vs neural network vs transformer

These three terms get used interchangeably in casual writing, but they nest inside each other.

A neural network is the broadest category: any model made of layers of weights and nonlinearities trained by gradient descent. Image classifiers, recommender systems, and protein-folding models are all neural networks.

A transformer is a specific neural-network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. Its defining feature is the self-attention mechanism, which lets each token in a sequence attend to every other token. Transformers can be used for text, images, audio, or anything that can be tokenized.

A large language model is one application of the transformer: a decoder-only transformer trained on text at very large scale. So every modern LLM is a transformer, every transformer is a neural network, but the reverse implications do not hold. A 50-million-parameter transformer trained on a small corpus is a language model, just not a large one. A convolutional image classifier is a neural network but not a transformer and not an LLM.

FAQ

Common Large language model questions

The chatbot is the product layer: the widget, the conversation history, the retrieval, the routing logic. The LLM is the model underneath that generates text. One LLM can power many chatbots.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML