What confidence scoring actually is
Confidence scoring is a numeric estimate of how reliable a generated answer is. It is not a single value baked into the model; it is a signal you compute on top of generation, then use to decide what the bot should do next.
Operators care about it for one reason: the bot will sometimes be wrong, and you want to know which answers are most likely to be the wrong ones before a visitor sees them. A confidence score is the lever that lets you treat a "definitely correct" answer differently from a "probably correct" answer differently from a "the bot is guessing" answer.
This sounds obvious. The trick is that LLMs do not natively report calibrated confidence. The number you compute has to come from somewhere, and each source measures a slightly different thing.
It also helps to be precise about what confidence scoring is not. It is not the same as a citation, which is direct evidence linking the answer to a source. It is also not the same as a hallucination check; hallucination is the failure mode you are trying to catch, and confidence scoring is one of the signals you use to catch it. Confidence is a probability. Citations are evidence. Hallucinations are what you get when both are missing.
How to score confidence: logprobs, retrieval, self-reflection
There are three practical approaches. Most production stacks combine at least two.
1. Token logprobs. When an LLM generates a token, it picks from a distribution over the entire vocabulary. The log-probability of the chosen token is its own report of how sure it was at that step. OpenAI's chat completions API exposes this through the logprobs and top_logprobs parameters; you can return up to the top 20 candidate tokens at each position along with their log-probabilities. To get a sentence-level score, you average (or take the minimum of) the log-probabilities across the generated tokens. Low average logprob means the model was wobbling between alternatives the whole way through. The caveat is that logprobs measure linguistic confidence, not factual confidence. A model can be extremely sure about how to phrase a wrong answer.
2. Retrieval score. If your bot is a retrieval-augmented generation system, the retriever already returns a similarity score for each top-k passage (cosine similarity, BM25 score, or a reranker output). If the top retrieved passages all score low, it means the index did not contain a good match for the question, regardless of how confidently the LLM later phrases its answer. This is often the most useful signal for a support bot, because it directly answers the question "do we even have content on this?"
3. Self-reflection or verifier. A second LLM call (or the same model on a second pass) is asked to grade the first answer against a rubric: "On a scale of 0 to 1, how confident are you that this answer is supported by the provided context?" Anthropic's 2022 paper Language Models (Mostly) Know What They Know found that with the right framing, larger models can produce reasonably calibrated self-assessments of factual claims. The trade-off is cost: you are paying for two generations per turn instead of one, and latency roughly doubles.
In practice, retrieval score is fast and cheap, logprobs are free if the model exposes them, and a verifier is the slow but accurate option you reserve for high-stakes routes.
Why confidence scoring matters for AI chatbots
Without a confidence signal, every answer looks the same to the visitor. A correct answer about your refund policy and a confidently fabricated answer about a feature you do not ship are rendered in identical type, with identical authority. The bot has no way to say "I am less sure about this one."
Operators use the score for three concrete things.
The first is soft user-facing flags. Below a threshold, the bot prepends a hedge or adds a small badge: "I am not fully certain about this. Please double check or contact support." Visitors generally respond well to this; calibrated uncertainty reads as competence, not weakness.
The second is routing. Below a stricter threshold, the bot does not answer at all. It escalates to a human, opens a support ticket, or offers a callback. This is the single highest-leverage use of confidence scoring, because it converts a class of wrong answers into a class of well-handled handoffs.
The third is internal flagging for content gaps. Every low-confidence answer gets logged. An operator reviewing the log sees "we got asked about X 40 times this week and the retrieval score was below 0.3 every time" and writes a docs page for X. Confidence scoring turns the bot's weakest answers into a backlog of content to fix.
The calibration problem: LLMs are overconfident
A model is well-calibrated when its stated confidence matches its actual accuracy: when it says it is 80% sure, it should be right about 80% of the time across many examples. Most production LLMs are not well-calibrated. They are systematically overconfident, especially on questions outside their training distribution.
Worse, RLHF (the standard fine-tuning step that makes models helpful and well-mannered) appears to make this worse, not better. A 2025 ICLR paper titled Taming Overconfidence in LLMs: Reward Calibration in RLHF found that reward models used during PPO training have an inherent bias toward high-confidence-sounding responses, regardless of whether those responses are actually correct. The training objective rewards sounding sure. The model learns to sound sure.
The practical consequence is that you should not trust a verbalized confidence statement ("I am 95% sure") as a reliable signal on its own. Compute the score from a more grounded source (retrieval similarity, logprobs averaged over the answer span, or a separate verifier call with a calibration rubric) and use the verbalized statement only as a UX layer for the visitor.
ChatRaj approach: ChatRaj uses retrieval score as the primary confidence signal. If the top retrieved passages from your indexed content have low similarity to the visitor's question, the bot says "I do not know" and offers to route to a human, rather than guessing from the model's parametric memory. Logprobs and a verifier check sit on top of that as secondary signals for higher-stakes routes. The result is a bot that is wrong less often, and visibly uncertain when it is uncertain. See also AI guardrails for how confidence thresholds fit into the broader safety layer.