What's the difference between an LLM and a chatbot?

The chatbot is the product layer: the widget, the conversation history, the retrieval, the routing logic. The LLM is the model underneath that generates text. One LLM can power many chatbots.

How many parameters does GPT-4 have?

OpenAI has not publicly confirmed a parameter count for GPT-4 or its successors. Independent estimates vary widely and the precise number is proprietary. Frontier proprietary models broadly sit in the hundreds of billions to low trillions of parameters, often as sparse mixture-of-experts.

Are all LLMs based on transformers?

Almost all major LLMs in 2026 are decoder-only transformers or sparse mixture-of-experts variants of that architecture. A small number of research models use state-space architectures like Mamba or hybrid attention designs, but the transformer is still the dominant shape.

Can I run an LLM locally?

Yes. Open-weight models from 7B to 70B parameters run on consumer GPUs (or even on a fast CPU with quantization). Tools like llama.cpp, Ollama, and vLLM make local inference practical. Larger models in the 200B to 400B range typically need multi-GPU rigs or quantized inference.

What does 'next-token prediction' mean?

Given the tokens seen so far, the model outputs a probability distribution over every possible next token in its vocabulary. A sampler picks one token from that distribution, appends it to the sequence, and the model runs again. Generation is just this loop, one token at a time.

What is a Large Language Model (LLM)?

What an LLM actually is

A large language model is a neural network, built on the transformer architecture, that learns to predict the next token in a sequence. Give it the prefix "The capital of France is" and it returns a probability distribution over every token in its vocabulary, with "Paris" near the top. Sample one token, append it, repeat. That loop is the entire generation process.

The word "large" is doing real work in the name. Modern LLMs have billions to trillions of parameters (the numerical weights the model learns during training). In 2026, the open weight tier spans roughly 7B to 405B parameters: Llama 3.3 ships as 8B and 70B variants, with Llama 3.1 still available at 405B. Mistral, Qwen, and DeepSeek release models in a similar range. Proprietary frontier models like Claude Opus 4, GPT-4.1, and Gemini 2.5 do not disclose parameter counts, but estimates put them in the hundreds of billions to low trillions, often using sparse mixture-of-experts so that only a fraction of weights activate per token.

Almost every major LLM in 2026 is decoder-only. That means the architecture is a stack of transformer blocks (self-attention plus feed-forward layers) reading left to right, with no separate encoder. GPT, Claude, Llama, and Gemini are all decoder-only or sparse mixture-of-experts variants of that pattern. A handful of research models experiment with state-space architectures (Mamba) or hybrid attention, but the decoder-only transformer is still the default shape.

How an LLM is trained (briefly)

Training happens in three rough stages.

1. Pretraining. The model reads trillions of tokens of raw text scraped from the web, books, code, and licensed corpora. The objective is simple: predict the next token. This is where the bulk of compute is spent and where general knowledge gets baked into the weights. The text is first run through a tokenization step that splits it into subword units the model can ingest.

2. Supervised fine-tuning (SFT). The pretrained model is a decent autocomplete engine but does not yet follow instructions politely. SFT continues training on a curated dataset of instruction response pairs written by humans, teaching the model the format of a helpful answer. This is one form of fine-tuning.

3. Alignment. Either RLHF (reinforcement learning from human feedback) or constitutional AI, depending on the lab. Humans rank model outputs, a reward model learns the ranking, and the LLM is optimized against that reward. Anthropic's constitutional AI variant replaces some of the human labeling with critiques generated by the model against a written constitution.

Compute requirements grow as a power law. The Chinchilla scaling laws (Hoffmann et al., 2022) showed that for compute optimal training, model parameters and training tokens should scale at roughly equal rates: every doubling of model size should be matched by a doubling of training data. The Chinchilla paper trained a 70B model on 1.4 trillion tokens, a ratio of 20 tokens per parameter, and beat the much larger but undertrained Gopher (280B) and GPT-3 (175B) on most benchmarks. Most modern LLMs are now trained well past Chinchilla-optimal because inference cost matters more than training cost once you have millions of users.

Why LLMs matter for AI chatbots

An LLM is the engine of every modern AI chatbot. When a visitor types "Do you ship to Canada?" into a widget on your site, that question is wrapped in a prompt, fed to an LLM, and the model generates a response one token at a time. The chatbot product is the layer around the model: the widget UI, the context window management, the retrieval system that pulls relevant content from your site, and the routing logic that decides which model to call.

ChatRaj routes each user message through an LLM, then grounds the response in retrieved passages from your content so the answer cites your real pages instead of the model's pretraining memory. This is the retrieval-augmented generation pattern, and it exists because LLMs on their own hallucinate. They are excellent at fluent language but have no built-in mechanism to distinguish a real fact from a plausible sounding invention. Retrieval gives the model evidence to ground its answer in.

The "large" in LLM matters here because smaller models below roughly 7B parameters tend to lose coherence on multi-turn customer support conversations, especially when the retrieved context is long. The emergent capabilities that make instruction-following feel competent (chain-of-thought reasoning, format compliance, tool use) generally appear above certain compute thresholds. Below those thresholds you get a model that sounds fine on short prompts and falls apart on real workflows.

LLM vs neural network vs transformer

These three terms get used interchangeably in casual writing, but they nest inside each other.

A neural network is the broadest category: any model made of layers of weights and nonlinearities trained by gradient descent. Image classifiers, recommender systems, and protein folding models are all neural networks.

A transformer is a specific neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. Its defining feature is the self-attention mechanism, which lets each token in a sequence attend to every other token. Transformers can be used for text, images, audio, or anything that can be tokenized.

A large language model is one application of the transformer: a decoder-only transformer trained on text at very large scale. So every modern LLM is a transformer, every transformer is a neural network, but the reverse implications do not hold. A 50-million-parameter transformer trained on a small corpus is a language model, just not a large one. A convolutional image classifier is a neural network but not a transformer and not an LLM.

Large language model (LLM)

What an LLM actually is

How an LLM is trained (briefly)

Why LLMs matter for AI chatbots

LLM vs neural network vs transformer

Common Large language model questions

Sources & further reading

Ship your first chatbot in 60 seconds.

Large language model (LLM)

What an LLM actually is

How an LLM is trained (briefly)

Why LLMs matter for AI chatbots

LLM vs neural network vs transformer

Related terms

Common Large language model questions

Sources & further reading

Ship your first chatbot in 60 seconds.