What top-p sampling actually is
At every generation step, a large language model produces a probability distribution over its entire vocabulary. Top-p sampling restricts the candidate pool before sampling. The recipe is short:
- Sort tokens by probability in descending order.
- Walk down the list, adding tokens to the pool until their cumulative probability is at least p.
- Renormalize the kept tokens so they sum to 1.
- Sample one token from that renormalized "nucleus".
If p is 0.9, you sample from whatever prefix of the sorted vocabulary covers 90 percent of the probability mass. Sometimes that nucleus is two tokens, sometimes two hundred. The pool size is data driven rather than fixed.
The technique comes from Holtzman, Buys, Du, Forbes, and Choi in their 2019 paper "The Curious Case of Neural Text Degeneration", which showed that decoding with pure likelihood maximization produces dull, repetitive output even from strong models. Their fix was to truncate the long, unreliable tail of the distribution without locking the pool at a fixed width.
How nucleus sampling differs from top-k
Top-k sampling truncates the candidate pool at a fixed number of tokens. K equals 50 means you always sample from the 50 most likely tokens, regardless of how the probability is shaped.
That fixed cutoff is the problem. Consider two situations:
- The model is confident. The top token has 0.95 probability and the rest are noise. Top-k with k=50 still passes 49 garbage tokens into the sampler, where they get a small but real chance of being picked.
- The model is uncertain. A hundred tokens each have around 0.01 probability and any of them would be reasonable. Top-k with k=50 chops off half the legitimate candidates, biasing toward arbitrary alphabetical or training-frequency artifacts.
Top-p adapts to both cases. In the confident case, the nucleus contains just the top token (or two), so noise is excluded. In the uncertain case, the nucleus stretches to include everything that meaningfully contributes to the mass. Pure unrestricted sampling, by contrast, keeps the entire tail, which is where the degenerate repetition Holtzman documented tends to originate.
That dynamic pool sizing is the whole pitch. Same parameter, different effective k at every step.
Why top-p matters for AI chatbots
AI chatbots generate text token by token. The sampler runs once per token, and the cumulative effect of those choices is what users perceive as voice, fluency, and coherence.
A nucleus that is too tight (low p, say 0.5) makes a support bot sound stilted and template-like because the sampler keeps falling back to the single most probable continuation. A nucleus that is too loose (high p, near 1.0) lets the long tail leak through, which is where hallucinated product names, made-up URLs, and weird syntactic detours come from.
The conventional middle ground that major providers ship as the default is p between 0.9 and 0.95. OpenAI defaults its chat models to top_p of 1.0 (the full distribution) and instead uses temperature for control, while Anthropic and Hugging Face decoding examples commonly land near 0.9. For chatbot answers grounded in retrieved documents, slightly tighter values like 0.85 reduce drift without making the output robotic.
Top-p plus temperature: don't tune both at the same time
This is the operational rule the OpenAI API docs repeat for every sampling-capable model: change top_p or change temperature, not both. The reason is that both knobs reshape the same underlying probability distribution, just in different ways.
- Temperature scales the logits before softmax. Lower temperature sharpens the distribution and concentrates mass on the top tokens. Higher temperature flattens it.
- Top-p truncates the distribution after softmax. It throws away the tail past the cumulative threshold.
Lower the temperature and the nucleus naturally shrinks because the head of the distribution holds more mass. Lower top_p on top of that and you compound the effect in a way that is hard to reason about. You can end up with a near-greedy decoder while thinking you only made a small change.
The clean workflow is: pick one. If your team is more comfortable with temperature, leave top_p at the provider default (1.0 for OpenAI, around 0.9 for many open-source stacks) and tune temperature in the 0.0 to 1.0 range. If you prefer top_p, leave temperature at 1.0 and move p between roughly 0.7 and 0.95.
ChatRaj keeps top_p at provider defaults and tunes temperature only, per OpenAI's guidance. Each ChatRaj agent exposes a temperature slider in the dashboard, and the underlying sampling pool is left to the provider's recommended setting so the temperature change has a predictable effect.
A few practical notes for chatbot operators:
- Top-p does not by itself make output deterministic. Even at p of 0.1, you are still sampling from whatever fits inside that 10 percent mass. To force a single output, set temperature to 0 (greedy decoding), which bypasses sampling entirely.
- Top-p of 1.0 is not the same as "no sampling". It still draws from the full distribution scaled by temperature.
- If you see repetitive loops, the fix is usually a repetition penalty or a frequency penalty, not nudging top_p lower. Tighter nuclei can make loops worse by removing the escape-hatch tokens.
Used well, top-p is the sampler setting you can set once at the start of a project and forget about. That is roughly the point Holtzman was making seven years ago, and it still holds.