ChatRaj
Retrieval & search

Embedding model

An embedding model is a neural network that maps a piece of text to a fixed-length vector so that semantically similar passages land near each other.

Bottom line
An embedding model is a neural network that maps a piece of text to a fixed-length vector so that semantically similar passages land near each other. In a RAG chatbot, the embedding model is the single biggest lever on retrieval quality, ahead of the vector database or the LLM itself.
Reviewed by ··5 min read
Jump to section

What an embedding model actually is

An embedding model is a small neural network, almost always a transformer encoder, that takes a chunk of text and emits a fixed-dimension vector of floating-point numbers. A typical output looks like a list of 1,024 or 1,536 values. The numbers themselves are not human-readable, but they have a useful property: passages that mean similar things end up close together in vector space, while passages on different topics drift apart. "Cancel my subscription" and "How do I close my account" land near each other; "How do I close my account" and "How tall is the Eiffel Tower" do not.

That property is what makes dense retrieval possible. A user question and a candidate passage get embedded with the same model, and a distance metric like cosine similarity ranks passages by how close they sit to the question. Store the passage vectors in a vector database and you have a semantic search index that handles paraphrase, synonyms, and rough intent matching without any hand-tuned keywords.

The embedding model is the thing that produces vectors. The vector database is the thing that stores and searches them. Mixing those up is the most common source of confusion when teams start building RAG.

How an embedding model is trained (briefly)

Most modern embedding models are trained with contrastive learning. The recipe started with Dense Passage Retrieval (DPR) in 2020 and has been refined steadily since. You collect pairs of texts that should be close (a question and a passage that answers it, two paraphrases, a query and a clicked result) and pairs that should be far (the same question paired with an unrelated passage). The model is trained to push positives together and negatives apart, usually with an InfoNCE loss.

The interesting wrinkles are hard-negative mining (deliberately picking negatives that look superficially similar, so the model learns finer distinctions) and instruction tuning (prefixing inputs with a short task description so one model can handle search, classification, and clustering). Open families like E5 and BGE popularised instruction-tuned encoders; the BGE-M3 model added multi-vector and multilingual training on top.

Most production models are initialised from a pretrained transformer (a small BERT or DeBERTa variant) and then fine-tuned on tens of millions of pairs. A useful mental model: the base transformer learns language, the contrastive stage teaches it what "similar" means.

Why embedding-model choice matters for AI chatbots

Swap the embedding model in a working RAG pipeline and retrieval recall at the top-k you care about can move five to fifteen points on the same corpus. Swap the LLM and the answer phrasing changes, but if retrieval missed the right passage there is nothing for the LLM to ground on. That is why embedding choice gets disproportionate attention in serious RAG builds.

A few specifics matter:

  • Domain match. General-web models do fine on marketing copy and product docs. Specialised corpora (legal contracts, medical notes, source code) usually benefit from a domain-tuned model. Voyage AI ships voyage-code, voyage-finance, and voyage-law variants for exactly this reason.
  • Language coverage. If your knowledge base has German, Japanese, and English mixed together, the multilingual variants of BGE-M3, Cohere Embed, and OpenAI handle it natively. Monolingual models will retrieve poorly across languages.
  • Dimensions and storage cost. Each dimension is a 4-byte float at full precision. At 3,072 dimensions and a million chunks you are storing roughly 12 GB just for vectors before any index overhead. Smaller dims (768, 1,024) cut storage and query cost; Matryoshka representations let you store full-dim and truncate at query time with minimal quality loss.
  • Self-host vs API. OpenAI, Cohere, and Voyage are managed APIs: easy, metered, and your data leaves the building. BGE-M3, Nomic Embed, and Mixedbread mxbai-embed are open weights you can run on your own GPU for cost control or data residency.

ChatRaj defaults to a managed embedding API for general content; specialised corpora can swap to domain-tuned models without rewriting the retrieval pipeline.

Picking an embedding model in 2026

The leaderboard of record is MTEB, the Massive Text Embedding Benchmark. Always check the live version before locking in a model; the top of the board moves every few months. As of mid-2026 the broad landscape looks like this:

  • OpenAI text-embedding-3-small (1,536 dims, around $0.02 per million tokens) is the default "good enough" managed option for general English and major languages. text-embedding-3-large (3,072 dims, around $0.13 per million tokens) is the higher-quality sibling, with Matryoshka support so you can truncate dimensions.
  • Cohere Embed v3 and v4 (1,024 dims, strong multilingual variants) are the other big managed option, with explicit support for short queries vs long documents via input types.
  • Voyage AI voyage-3 and the voyage-3-large variant lead several MTEB tracks and ship domain-specialised checkpoints for code, finance, and legal.
  • Open source. BGE-M3 (multi-vector, 100+ languages), Nomic Embed, Mixedbread mxbai-embed, and the Qwen3-Embedding family are competitive open-weight options. Top open-source models now sit at or above commercial APIs on MTEB English. Self-hosting needs a GPU and basic ops discipline, but the unit economics flip in your favour above a few hundred million tokens a month.

The decision tree is shorter than it looks. If you want one knob to turn, start with a managed API matched to your language mix, run a real retrieval eval on a few hundred labelled queries from your own product, and only switch to self-hosted or a domain-tuned model when the eval says you need to. If retrieval is still weak after that, look at fine-tuning the embedder on your own positive pairs before blaming the chunker or the LLM.

FAQ

Common Embedding model questions

It depends on domain and language coverage. Check the live MTEB leaderboard before locking in a choice. Common SOTA picks in mid-2026 include OpenAI text-embedding-3-large, Voyage AI's voyage-3 family (especially the domain-specialised variants), Cohere Embed v4, and open-weight options like BGE-M3 and Qwen3-Embedding.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML