What temperature actually is
Temperature is a number you pass to a large language model at inference time that controls how random the next-token choice is. It is not part of the model weights and it does not change what the model knows. It only changes how the model's raw output, called logits, is converted into a probability distribution that gets sampled from.
The formula is one line. Given a vector of logits z for every token in the vocabulary, the model normally applies the softmax function to turn those logits into probabilities. Temperature divides the logits before the softmax:
p_i = exp(z_i / T) / sum_j exp(z_j / T)
That is the entire mechanism. T is a positive real number, usually between 0 and 2. When T is small (say 0.1), dividing by it makes the logits much larger in magnitude, which makes the softmax sharper: the top token's probability climbs toward 1 and everything else collapses toward 0. When T is large (say 1.5), dividing shrinks the logits, which flattens the softmax so even unlikely tokens get a meaningful share of probability mass. T equal to 1 leaves the original distribution unchanged.
The technique comes from Hinton, Vinyals, and Dean's 2015 knowledge-distillation paper, which introduced T-scaled softmax to expose the "dark knowledge" hidden in a teacher model's full probability distribution. Modern LLM APIs inherited the same parameter name and the same math.
How temperature changes the output distribution
A concrete example. Suppose the model is finishing the sentence "The capital of France is" and the top three logits are Paris (5.0), Lyon (2.0), and Marseille (1.0). At T = 1, softmax gives Paris around 91%, Lyon around 5%, Marseille around 2%. At T = 0.2, those logits get divided by 0.2, so Paris jumps to roughly 99.99% and the other cities are effectively rounded to zero. At T = 2, the logits are halved and the spread tightens: Paris drops to about 67%, Lyon climbs to 15%, Marseille to 9%.
That last case is why high temperatures feel "creative." The model is not generating different ideas, it is just sampling further down its own ranked list of candidates. You get more variety, but also more nonsense and more hallucination, because the long tail of low-probability tokens includes plenty of word choices that are grammatical but wrong.
Most commercial APIs cap the range. OpenAI accepts temperatures from 0 to 2 with a default of 1. Anthropic accepts 0 to 1 with a default of 1 and rejects anything higher. Google's Gemini API documents a 0 to 2 range with a default of 1 for the Gemini 2.5 and 3 model families, though older Gemini endpoints clamped at 1. ChatRaj defaults to a low temperature for RAG answers so the bot stays grounded in retrieved content rather than ad-libbing from the model's pretraining memory.
Temperature is usually tuned alongside top-p sampling, a different mechanism that truncates the candidate list to the smallest set whose cumulative probability exceeds p. The two work on opposite ends of the pipeline. Temperature reshapes the distribution; top-p chops off the tail. Most provider guidance is to tune one and leave the other near its default rather than aggressively moving both.
Why temperature matters for AI chatbots
For a customer-facing chatbot, temperature is the lever that decides whether your bot sounds reliable or reckless. A grounded retrieval bot that quotes your help-center articles should sit at a low temperature, somewhere in the 0.0 to 0.3 range. The model has retrieved evidence in its context window and the job is to phrase it back accurately. There is no upside to creative paraphrasing, and there is a real downside if the model invents a refund policy that does not exist.
A marketing copy generator or a brainstorming assistant lives at the opposite end. There you want variety on every regeneration, so 0.8 to 1.2 is typical. Code generation also wants low temperatures (0.0 to 0.2) because there is usually exactly one correct line and many almost-right lines that will not compile.
Rough working defaults for product builders:
- Code or SQL generation: T = 0.0 to 0.2
- Grounded RAG answers and factual Q&A: T = 0.2 to 0.4
- General conversational assistant: T = 0.5 to 0.8
- Creative writing, brainstorming, taglines: T = 0.7 to 1.2
Temperature is distinct from prompt engineering. The prompt is the input the model reads; temperature is a sampler knob applied after the model has produced its logits. You can have a perfectly written prompt and still get wild outputs if T is set high, and you can have a sloppy prompt that is rescued from variance by setting T to 0.
Temperature 0 is not deterministic (the gotcha)
The single most common misconception about temperature is that T = 0 guarantees identical output across calls. It does not, and most blog posts skip this.
At T = 0 the softmax collapses to an argmax: the highest-logit token wins every time. This is called greedy decoding. In theory that should produce byte-identical output on every call with the same prompt. In practice it does not, because the major providers run inference in batched, mixed-precision kernels on GPUs where floating-point operations are not associative. Two runs that batch your request alongside different neighbors can produce different rounding in the final logits, and a tiny rounding shift can flip which token is the argmax. Once one token differs, the rest of the generation diverges.
OpenAI and Anthropic both document this. Their answer is the seed parameter, which gives best-effort reproducibility: same prompt plus same seed plus same model snapshot will usually give the same output, but is not contractually guaranteed. The OpenAI API also returns a system_fingerprint you can check to detect when the backend has changed underneath you. If you need true determinism for evals or regression tests, the realistic move is to set temperature to 0, pass a fixed seed, log the fingerprint, and accept that occasional drift is part of the deal.