Glossary

Embeddings APIs

Embeddings APIs are hosted endpoints that take a string and return a vector $\mathbf{v} \in \mathbb{R}^d$ such that semantically similar inputs have small cosine distance. They are the front door to modern retrieval and form the substrate of RAG, agent memory, recommendation, deduplication, classification, and anomaly detection.

Production landscape (2025)

Provider Model Dim Notes
OpenAI text-embedding-3-small 1536 (Matryoshka, truncatable) Cheap, strong default
OpenAI text-embedding-3-large 3072 Best closed model on MTEB until 2024
Cohere embed-v3 1024 Multilingual, strong rerank pairing
Voyage AI voyage-3 / voyage-code-3 1024–2048 Domain-specialised, top of MTEB
Google text-embedding-004 768 Strong on Gemini stack
Anthropic (uses Voyage) , Rebrand of Voyage; no native embeddings
BGE (BAAI) bge-m3, bge-large-en-v1.5 1024 Strongest open model, multilingual
E5 (Microsoft) e5-mistral-7b-instruct 4096 LLM-based open model
Jina jina-embeddings-v3 1024 8k context, multilingual
Nomic nomic-embed-text-v1.5 768 Fully open, audited

Matryoshka embeddings

Modern models (OpenAI 3.x, Nomic, Voyage) train with Matryoshka representation learning: the first $k$ dimensions of a $d$-dim vector are themselves a valid embedding of dimension $k$. This lets developers truncate at index time:

full = openai.embeddings.create(model="text-embedding-3-large", input=text).data[0].embedding
fast = full[:512]   # still semantically meaningful

Trades 80% storage savings for ~2% recall.

Training objective

Embeddings models are typically bi-encoders trained with contrastive loss:

$$\mathcal{L} = -\log \frac{\exp(\text{sim}(q, d^+)/\tau)}{\exp(\text{sim}(q, d^+)/\tau) + \sum_{d^-} \exp(\text{sim}(q, d^-)/\tau)}$$

where $(q, d^+)$ is a positive query/document pair and $d^-$ are negatives. The temperature $\tau$ controls sharpness.

Hard-negative mining and large in-batch negatives are the dominant tricks for SOTA performance.

MTEB

The Massive Text Embedding Benchmark (Muennighoff et al. 2022) is the canonical leaderboard: 56+ tasks covering retrieval, classification, clustering, STS, summarisation, and re-ranking. The MTEB leaderboard is updated daily.

Choice guidance (2025)

  • Default closed: OpenAI text-embedding-3-small for cost, text-embedding-3-large for quality.
  • Open-source winner: bge-m3 for multilingual + multifunctional; e5-mistral-7b-instruct for max quality.
  • Code: voyage-code-3 or jina-code-v2.
  • Multilingual: bge-m3 or Cohere embed-multilingual-v3.

Production gotchas

  1. Distribution shift, model upgrades change the vector space; you must re-index.
  2. Domain mismatch, generic embeddings underperform on legal, biomedical, or code; consider domain models.
  3. Asymmetric retrieval, some models need different prefixes for queries vs documents (query: vs passage: for E5).
  4. Quantisation, int8 or binary quantisation cuts storage 4–32× with small recall loss; supported by all major DBs.

Related terms: Vector Database, Retrieval-Augmented Generation, Re-Ranking, Memory and Context Management

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).