What's the best embedding model in 2026?

It depends on domain and language coverage. Check the live MTEB leaderboard before locking in a choice. Common SOTA picks in mid-2026 include OpenAI text-embedding-3-large, Voyage AI's voyage-3 family (especially the variants specialised by domain), Cohere Embed v4, and open-weight options like BGE-M3 and Qwen3-Embedding.

How many dimensions should the embeddings have?

1,024 or 1,536 is usually the sweet spot between quality and storage cost. Models trained with Matryoshka representations let you store the full vector and truncate at query time, so you can experiment with 512 or 256 dims without re-embedding the corpus.

Can I fine-tune an embedding model?

Yes. Contrastive fine-tuning on a few thousand positive pairs specific to your domain (queries plus the passages that answer them) can meaningfully lift retrieval recall on niche corpora. Hard-negative mining matters more than raw pair count.

Should I self-host or use an API?

Use a managed API (OpenAI, Cohere, Voyage) when you want convenience and top tier quality at small to medium scale. Self-host BGE-M3, Nomic Embed, or a Qwen3-Embedding checkpoint when you need cost control at high volume, data residency, or full control over model versioning.

What about multilingual content?

BGE-M3, Cohere Embed v3 and v4 multilingual variants, and OpenAI's text-embedding-3 models all handle 100+ languages natively in a shared vector space, so a German query can retrieve English passages. Monolingual models will retrieve poorly across languages and should be avoided for mixed corpora.

What is an Embedding Model? (Choosing One for RAG)

What an embedding model actually is

An embedding model is a small neural network, almost always a transformer encoder, that takes a chunk of text and emits a fixed dimension vector of floating point numbers. A typical output looks like a list of 1,024 or 1,536 values. The numbers themselves are not human-readable, but they have a useful property: passages that mean similar things end up close together in vector space, while passages on different topics drift apart. "Cancel my subscription" and "How do I close my account" land near each other; "How do I close my account" and "How tall is the Eiffel Tower" do not.

That property is what makes dense retrieval possible. A user question and a candidate passage get embedded with the same model, and a distance metric like cosine similarity ranks passages by how close they sit to the question. Store the passage vectors in a vector database and you have a semantic search index that handles paraphrase, synonyms, and rough intent matching without any hand tuned keywords.

The embedding model is the thing that produces vectors. The vector database is the thing that stores and searches them. Mixing those up is the most common source of confusion when teams start building RAG.

How an embedding model is trained (briefly)

Most modern embedding models are trained with contrastive learning. The recipe started with Dense Passage Retrieval (DPR) in 2020 and has been refined steadily since. You collect pairs of texts that should be close (a question and a passage that answers it, two paraphrases, a query and a clicked result) and pairs that should be far (the same question paired with an unrelated passage). The model is trained to push positives together and negatives apart, usually with an InfoNCE loss.

The interesting wrinkles are hard-negative mining (deliberately picking negatives that look superficially similar, so the model learns finer distinctions) and instruction tuning (prefixing inputs with a short task description so one model can handle search, classification, and clustering). Open families like E5 and BGE popularised instruction-tuned encoders; the BGE-M3 model added multi-vector and multilingual training on top.

Most production models are initialised from a pretrained transformer (a small BERT or DeBERTa variant) and then fine-tuned on tens of millions of pairs. A useful mental model: the base transformer learns language, the contrastive stage teaches it what "similar" means.

Why embedding model choice matters for AI chatbots

Swap the embedding model in a working RAG pipeline and retrieval recall at the top-k you care about can move five to fifteen points on the same corpus. Swap the LLM and the answer phrasing changes, but if retrieval missed the right passage there is nothing for the LLM to ground on. That is why embedding choice gets disproportionate attention in serious RAG builds.

A few specifics matter:

Domain match. General web models do fine on marketing copy and product docs. Specialised corpora (legal contracts, medical notes, source code) usually benefit from a model tuned to that domain. Voyage AI ships voyage-code, voyage-finance, and voyage-law variants for exactly this reason.
Language coverage. If your knowledge base has German, Japanese, and English mixed together, the multilingual variants of BGE-M3, Cohere Embed, and OpenAI handle it natively. Monolingual models will retrieve poorly across languages.
Dimensions and storage cost. Each dimension is a 4-byte float at full precision. At 3,072 dimensions and a million chunks you are storing roughly 12 GB just for vectors before any index overhead. Smaller dims (768, 1,024) cut storage and query cost; Matryoshka representations let you store full-dim and truncate at query time with minimal quality loss.
Self-host vs API. OpenAI, Cohere, and Voyage are managed APIs: easy, metered, and your data leaves the building. BGE-M3, Nomic Embed, and Mixedbread mxbai-embed are open weights you can run on your own GPU for cost control or data residency.

ChatRaj defaults to a managed embedding API for general content; specialised corpora can swap to models tuned for the domain without rewriting the retrieval pipeline.

Picking an embedding model in 2026

The leaderboard of record is MTEB, the Massive Text Embedding Benchmark. Always check the live version before locking in a model; the top of the board moves every few months. As of mid-2026 the broad landscape looks like this:

OpenAI text-embedding-3-small (1,536 dims, around $0.02 per million tokens) is the default "good enough" managed option for general English and major languages. text-embedding-3-large (3,072 dims, around $0.13 per million tokens) is the higher quality sibling, with Matryoshka support so you can truncate dimensions.
Cohere Embed v3 and v4 (1,024 dims, strong multilingual variants) are the other big managed option, with explicit support for short queries vs long documents via input types.
Voyage AI voyage-3 and the voyage-3-large variant lead several MTEB tracks and ship checkpoints specialised by domain for code, finance, and legal.
Open source. BGE-M3 (multi-vector, 100+ languages), Nomic Embed, Mixedbread mxbai-embed, and the Qwen3-Embedding family are competitive open weight options. Top open-source models now sit at or above commercial APIs on MTEB English. Self-hosting needs a GPU and basic ops discipline, but the unit economics flip in your favour above a few hundred million tokens a month.

The decision tree is shorter than it looks. If you want one knob to turn, start with a managed API matched to your language mix, run a real retrieval eval on a few hundred labelled queries from your own product, and only switch to self-hosted or a domain tuned model when the eval says you need to. If retrieval is still weak after that, look at fine-tuning the embedder on your own positive pairs before blaming the chunker or the LLM.

Embedding model

What an embedding model actually is

How an embedding model is trained (briefly)

Why embedding model choice matters for AI chatbots

Picking an embedding model in 2026

Common Embedding model questions

Sources & further reading

Ship your first chatbot in 60 seconds.

Embedding model

What an embedding model actually is

How an embedding model is trained (briefly)

Why embedding model choice matters for AI chatbots

Picking an embedding model in 2026

Related terms

Common Embedding model questions

Sources & further reading

Ship your first chatbot in 60 seconds.