Embeddings APIs, Glossary, Textbook of AI

Embeddings APIs are hosted endpoints that take a string and return a vector $\mathbf{v} \in \mathbb{R}^d$ such that semantically similar inputs have small cosine distance. They are the front door to modern retrieval and form the substrate of RAG, agent memory, recommendation, deduplication, classification, and anomaly detection.

Production landscape (2025)

Provider	Model	Dim	Notes
OpenAI	`text-embedding-3-small`	1536 (Matryoshka, truncatable)	Cheap, strong default
OpenAI	`text-embedding-3-large`	3072	Best closed model on MTEB until 2024
Cohere	`embed-v3`	1024	Multilingual, strong rerank pairing
Voyage AI	`voyage-3` / `voyage-code-3`	1024–2048	Domain-specialised, top of MTEB
Google	`text-embedding-004`	768	Strong on Gemini stack
Anthropic	(uses Voyage)	,	Rebrand of Voyage; no native embeddings
BGE (BAAI)	`bge-m3`, `bge-large-en-v1.5`	1024	Strongest open model, multilingual
E5 (Microsoft)	`e5-mistral-7b-instruct`	4096	LLM-based open model
Jina	`jina-embeddings-v3`	1024	8k context, multilingual
Nomic	`nomic-embed-text-v1.5`	768	Fully open, audited

Matryoshka embeddings

Modern models (OpenAI 3.x, Nomic, Voyage) train with Matryoshka representation learning: the first $k$ dimensions of a $d$-dim vector are themselves a valid embedding of dimension $k$. This lets developers truncate at index time:

full = openai.embeddings.create(model="text-embedding-3-large", input=text).data[0].embedding
fast = full[:512]   # still semantically meaningful

Trades 80% storage savings for ~2% recall.

Training objective

Embeddings models are typically bi-encoders trained with contrastive loss:

$$\mathcal{L} = -\log \frac{\exp(\text{sim}(q, d^+)/\tau)}{\exp(\text{sim}(q, d^+)/\tau) + \sum_{d^-} \exp(\text{sim}(q, d^-)/\tau)}$$

where $(q, d^+)$ is a positive query/document pair and $d^-$ are negatives. The temperature $\tau$ controls sharpness.

Hard-negative mining and large in-batch negatives are the dominant tricks for SOTA performance.

MTEB

The Massive Text Embedding Benchmark (Muennighoff et al. 2022) is the canonical leaderboard: 56+ tasks covering retrieval, classification, clustering, STS, summarisation, and re-ranking. The MTEB leaderboard is updated daily.

Choice guidance (2025)

Default closed: OpenAI text-embedding-3-small for cost, text-embedding-3-large for quality.
Open-source winner: bge-m3 for multilingual + multifunctional; e5-mistral-7b-instruct for max quality.
Code: voyage-code-3 or jina-code-v2.
Multilingual: bge-m3 or Cohere embed-multilingual-v3.

Production gotchas

Distribution shift, model upgrades change the vector space; you must re-index.
Domain mismatch, generic embeddings underperform on legal, biomedical, or code; consider domain models.
Asymmetric retrieval, some models need different prefixes for queries vs documents (query: vs passage: for E5).
Quantisation, int8 or binary quantisation cuts storage 4–32× with small recall loss; supported by all major DBs.

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).