IVF-PQ is the workhorse approximate-nearest-neighbour index for billion-scale vector retrieval, combining two complementary techniques: an inverted file (IVF) for coarse partitioning and product quantisation (PQ) for compressed distance approximation. It is exposed in FAISS as IndexIVFPQ.
Coarse quantisation (IVF). Run k-means on a sample of the database with $n_{\text{list}}$ centroids $\{\mathbf{c}_1, \ldots, \mathbf{c}_{n_{\text{list}}}\}$. Each database vector $\mathbf{x}$ is assigned to its nearest centroid, and the residual $\mathbf{r} = \mathbf{x} - \mathbf{c}_{q(\mathbf{x})}$ is stored rather than $\mathbf{x}$ itself. At query time only the $n_{\text{probe}}$ centroids closest to $\mathbf{q}$ are examined, pruning the search space by roughly $n_{\text{list}} / n_{\text{probe}}$.
Product quantisation (PQ). To compress the residuals, split each $d$-dimensional residual into $m$ subvectors of dimension $d/m$:
$$\mathbf{r} = (\mathbf{r}^{(1)}, \mathbf{r}^{(2)}, \ldots, \mathbf{r}^{(m)}), \quad \mathbf{r}^{(j)} \in \mathbb{R}^{d/m}.$$
For each subspace $j$ run k-means with $K$ centroids (typically $K = 256$ so each code fits in one byte), giving codebooks $\mathcal{C}^{(j)} = \{\mathbf{c}^{(j)}_1, \ldots, \mathbf{c}^{(j)}_K\}$. Encode $\mathbf{r}$ as $m$ codebook indices
$$(i_1, i_2, \ldots, i_m), \quad i_j = \arg\min_k \|\mathbf{r}^{(j)} - \mathbf{c}^{(j)}_k\|^2.$$
A vector originally taking $4d$ bytes (float32) is compressed to $m$ bytes; with $d = 768$ and $m = 64$ this is a $48\times$ memory reduction.
Asymmetric distance computation. Given the (uncompressed) query $\mathbf{q}$ and the (compressed) database vector $\mathbf{x}$, the squared Euclidean distance is approximated by
$$\|\mathbf{q} - \hat{\mathbf{x}}\|^2 \;=\; \sum_{j=1}^{m} \|\mathbf{q}^{(j)} - \mathbf{c}^{(j)}_{i_j}\|^2.$$
This sum is computed efficiently by precomputing, once per query, an $m \times K$ lookup table $L[j, k] = \|\mathbf{q}^{(j)} - \mathbf{c}^{(j)}_k\|^2$ and adding $m$ entries per database vector ($O(m)$ rather than $O(d)$ per comparison).
Putting it together.
Index time. Train coarse quantiser; assign each $\mathbf{x}$ to a centroid; train PQ codebooks on residuals; encode and store $(q(\mathbf{x}), \text{PQ code})$ in inverted lists.
Query time. Find the $n_{\text{probe}}$ closest coarse centroids to $\mathbf{q}$. For each, compute the corresponding query residual $\mathbf{q} - \mathbf{c}_i$, build the PQ lookup table, scan the inverted list summing PQ-approximated distances, and maintain a min-heap of the top $k$.
Trade-offs.
- $n_{\text{list}}$ controls coarse-search granularity (typical $\sqrt{N}$).
- $n_{\text{probe}}$ controls recall–latency.
- $m$ controls memory–accuracy.
- Optional rotation pre-processing (OPQ) decorrelates subspaces and improves PQ accuracy.
IVF-PQ is the preferred index above $\sim 10^8$ vectors, where graph methods like HNSW become memory-prohibitive.
Discussed in:
- Chapter 14: Generative Models, Retrieval