AI Chatbot Glossary
35 plain-English definitions across retrieval, LLM internals, chatbot architecture, and chunking.
Retrieval & search
(10)- BM25
BM25 ranks documents by keyword relevance using term frequency, inverse document frequency, and length normalization. The default in Lucene, Elasticsearch, and Postgres FTS.
- TF-IDF
TF-IDF scores how distinctive a word is in a document, weighing how often it appears against how rare it is across the corpus. The 50-year-old foundation of keyword search.
- Cosine similarity
Cosine similarity measures the angle between two vectors, scoring semantic closeness regardless of vector magnitude. The default similarity metric for embedding-based search.
- Dense retrieval
Dense retrieval encodes queries and documents as fixed-dimension vectors and finds matches by vector similarity, surfacing semantically related content that keyword search misses.
- Sparse retrieval
Sparse retrieval represents documents as high-dimensional vectors with mostly zero values, indexed in an inverted index for fast keyword lookup. The category that includes BM25, TF-IDF, and SPLADE.
- Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion (RRF) combines multiple ranked lists into one by summing 1/(k+rank) across systems. The standard fusion algorithm for hybrid search.
- ColBERT
ColBERT is a late-interaction retrieval model that stores per-token embeddings and computes relevance via MaxSim, giving dense-quality accuracy with sparse-like efficiency.
- Hybrid search
Hybrid search runs keyword and semantic retrieval in parallel and fuses their rankings. The default architecture for modern RAG, beating either approach alone on the BEIR benchmark.
- Vector database
A vector database indexes high-dimensional embeddings for fast nearest-neighbor search. Powers RAG, semantic search, and recommendation systems. Often Postgres with pgvector is enough.
- Embedding model
An embedding model converts text into a fixed-dimension vector that encodes meaning. The choice between OpenAI, Cohere, Voyage AI, and open-source BGE drives RAG quality.
LLM internals
(10)- Large language model (LLM)
A large language model is a transformer-based neural network trained on trillions of tokens to predict the next token. The class includes GPT, Claude, Gemini, and Llama.
- Tokenization
Tokenization splits text into subword units that LLMs process. Byte-Pair Encoding is the dominant algorithm; cost and context length are billed per token.
- Context window
The context window is the maximum number of tokens an LLM can process per request, including system prompt, retrieved passages, conversation history, and output.
- Temperature
Temperature is a sampling hyperparameter that scales the softmax distribution. Lower values produce predictable output; higher values produce more diverse, creative output.
- Top-p sampling
Top-p sampling restricts the next-token candidate pool to the smallest set whose cumulative probability exceeds p. Adapts dynamically to the model's confidence at each step.
- Reinforcement learning from human feedback (RLHF)
RLHF aligns language models with human preferences using a reward model trained on pairwise comparisons. The post-pretraining step that made ChatGPT, Claude, and Gemini usable.
- Fine-tuning
Fine-tuning continues training a pretrained model on a smaller domain dataset. LoRA makes it cheap. For most chatbots, retrieval-augmented generation is the better starting point.
- Quantization
Quantization compresses LLM weights from FP16 to INT8 or INT4, shrinking memory and speeding inference at modest quality cost. Makes 70B models runnable on consumer GPUs.
- KV cache
The KV cache stores key and value tensors from past attention steps so the model doesn't recompute them each token. The mechanism behind prompt caching's 50%+ cost savings.
- Streaming response
Streaming response delivers LLM output token-by-token over Server-Sent Events as it's generated, cutting perceived latency by 80%+ versus waiting for the full completion.
Application & chatbot architecture
(10)- Retrieval-augmented generation (RAG)
Retrieval-augmented generation grounds an LLM's response in retrieved passages, cutting hallucinations and enabling citations. The architecture behind most production chatbots.
- Prompt engineering
Prompt engineering shapes LLM behaviour through instruction design, examples, and structured output formats. A discipline that survives despite each new model release.
- System prompt
The system prompt is the LLM's standing-orders message: its identity, scope, tone, and refusal rules. Lives in a higher-priority role than user messages.
- Function calling
Function calling lets an LLM request the execution of a developer-defined tool by emitting structured JSON. The mechanism behind chatbot actions, integrations, and agents.
- Agentic AI
Agentic AI extends LLMs with autonomous planning, tool use, and multi-step execution. The frontier between chatbots and autonomous task workers.
- AI guardrails
AI guardrails are runtime checks around LLM input and output that enforce safety, scope, format, and brand rules. Complement model-level alignment but don't replace it.
- Hallucination
A hallucination is an LLM output that is fluent but factually wrong or fabricated. The root failure mode that retrieval-augmented generation, citation grounding, and refusal patterns mitigate.
- Citation grounding
Citation grounding forces an LLM to attribute every claim to a retrieved source. The mechanism behind 'I can verify this answer' chatbots and the antidote to fabricated references.
- Prompt injection
Prompt injection tricks an LLM into following attacker-controlled instructions instead of the developer's. OWASP ranks it the #1 LLM security risk; current defenses are imperfect.
- Confidence scoring
Confidence scoring estimates how sure an LLM is about an answer. Useful for flagging low-confidence responses for human review or showing visitors a 'we're not certain' badge.
Architecture & chunking
(5)- Document chunking
Document chunking splits long documents into retrievable passages before embedding. Chunk size, boundaries, and overlap quietly determine 30%+ of RAG quality.
- Chunk stride
Chunk stride is the gap between consecutive chunks. Negative stride (overlap) lets retrieval surface answers that span chunk boundaries. The forgotten 10% knob in RAG.
- Reranking
Reranking re-scores retrieved candidates with a stronger model before sending to the LLM. Lifts NDCG by 10-30% on most RAG benchmarks. Adds latency and cost.
- Cross-encoder
A cross-encoder concatenates query and document and runs them through a transformer to produce a single relevance score. Highest accuracy, too slow to retrieve over a full corpus.
- Sentence-Transformers
Sentence-Transformers is the open-source library that made embedding-based retrieval mainstream. Also refers to the SBERT class of bi-encoder models it ships.
Ship your first chatbot in 60 seconds.
Sign in with Google and you'll be answering visitor questions before your coffee gets cold.
60-second setup · One-line install · Works on any site