IVF-PQ: Inverted File with Product Quantization

12 min

Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.

Best viewed on desktop for optimal interactive experience

IVF-PQ: Inverted File with Product Quantization

IVF-PQ is a powerful technique that combines clustering-based indexing (IVF) with vector compression (PQ) to enable searching billions of vectors on a single machine with limited memory.

Interactive IVF-PQ Visualization

Explore how IVF-PQ partitions space into clusters and compresses vectors using product quantization:

IVF-PQ Visualization

Inverted File with Product Quantization - Scalable vector compression and search

Inverted File Clustering

Clusters: 8

The Billion-Scale Challenge

Storing 1 billion 768-dimensional float32 vectors requires:

109 × 768 × 4 \text{ bytes} = 3.072 \text{ TB}

IVF-PQ can reduce this to just 30-60 GB while maintaining 90%+ recall!

How IVF-PQ Works

Step 1: Inverted File Index (IVF)

Partition the vector space into clusters using k-means:

def build_ivf_index(vectors, n_clusters): # Train k-means on sample centroids = kmeans(vectors[:100000], n_clusters) # Assign vectors to nearest centroid inverted_lists = [[] for _ in range(n_clusters)] for i, vector in enumerate(vectors): cluster_id = nearest_centroid(vector, centroids) inverted_lists[cluster_id].append(i) return centroids, inverted_lists

Step 2: Product Quantization (PQ)

Compress vectors by splitting into subvectors and quantizing:

x ∈ ℝd → [q1, q2, ..., qm] ∈ \{0,1,...,255\}m
def product_quantize(vector, codebooks): m = len(codebooks) # Number of subvectors d_sub = len(vector) // m codes = [] for i in range(m): subvector = vector[i*d_sub:(i+1)*d_sub] # Find nearest codeword code = nearest_codeword(subvector, codebooks[i]) codes.append(code) return codes # m bytes instead of d*4 bytes

Step 3: Search Process

  1. Find candidate clusters using probe parameter
  2. Scan compressed vectors in selected clusters
  3. Compute distances using lookup tables
  4. Return top-k results

Mathematical Foundation

Clustering Objective

K-means minimizes within-cluster variance:

J = Σi=1n Σj=1k rij ‖xi - μj2

Where:

  • rij is 1 if vector i belongs to cluster j
  • μj is the centroid of cluster j

Quantization Error

Product quantization minimizes reconstruction error:

E = 𝔼[‖x - x̂‖2] = Σj=1m 𝔼[‖x(j) - cqj(j)2]

Where:

  • x(j) is the j-th subvector
  • cqj(j) is the assigned codeword

Asymmetric Distance Computation

For efficiency, compute distances without decompressing:

d(q, x) ≈ Σj=1m ‖q(j) - ccodej(j)2

Pre-compute distance tables for each subvector of the query.

Key Parameters

IVF Parameters

ParameterTypical ValueEffect
nlist√N to N/100Number of clusters
nprobe1-128Clusters to search
train_size30×nlistTraining set size

PQ Parameters

ParameterTypical ValueEffect
M8-64Number of subvectors
nbits8Bits per subquantizer
codebook_size2^nbitsCodewords per subspace

Performance Analysis

Memory Savings

Original size vs compressed size:

VectorsDimensionOriginalIVF-PQ (M=32)Compression
1M128512 MB32 MB16×
10M51220 GB320 MB64×
100M768300 GB3.2 GB94×
1B7683 TB32 GB96×

Recall vs Speed Tradeoff

nprobe | Recall | QPS | Latency -------|--------|--------|---------- 1 | 70% | 50,000 | 0.02ms 4 | 85% | 20,000 | 0.05ms 16 | 93% | 5,000 | 0.2ms 64 | 97% | 1,250 | 0.8ms 256 | 99% | 300 | 3.3ms

Implementation with Faiss

Basic IVF-PQ Index

import faiss import numpy as np def create_ivf_pq_index(vectors, nlist=1024, m=32): d = vectors.shape[1] # Create quantizer (coarse quantizer) quantizer = faiss.IndexFlatL2(d) # Create IVF-PQ index index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8) # Train on subset train_size = max(nlist * 30, 100000) train_data = vectors[np.random.choice(len(vectors), train_size)] index.train(train_data) # Add vectors index.add(vectors) return index
def search_ivf_pq(index, queries, k=10, nprobe=16): # Set search parameters index.nprobe = nprobe # Search distances, indices = index.search(queries, k) return indices, distances # Example usage index = create_ivf_pq_index(database_vectors) results, scores = search_ivf_pq(index, query_vectors)

Advanced Configuration

# IVF-PQ with preprocessing index = faiss.index_factory( d, "OPQ32,IVF1024,PQ32x8", faiss.METRIC_L2 ) # Components: # - OPQ32: Optimized Product Quantization rotation # - IVF1024: 1024 clusters # - PQ32x8: 32 subvectors, 8 bits each

Optimization Strategies

1. Optimized Product Quantization (OPQ)

Apply rotation to minimize quantization error:

def train_opq(vectors, m): # Learn rotation matrix R R = learn_rotation_matrix(vectors, m) # Rotate vectors rotated_vectors = vectors @ R # Train PQ on rotated space pq = train_product_quantizer(rotated_vectors, m) return R, pq

2. Multi-Index Hashing

Use multiple hash tables for better recall:

class MultiIndexIVF: def __init__(self, d, n_tables=4): self.tables = [] for _ in range(n_tables): # Different random projections self.tables.append(create_ivf_index(d)) def search(self, query, k): candidates = set() for table in self.tables: results = table.search(query, k * 2) candidates.update(results) # Rerank candidates return rerank(query, candidates, k)

3. GPU Acceleration

# Move index to GPU gpu_res = faiss.StandardGpuResources() gpu_index = faiss.index_cpu_to_gpu(gpu_res, 0, cpu_index) # Batch search on GPU batch_size = 1000 results = gpu_index.search(queries[:batch_size], k)

Comparison with Other Methods

IVF-PQ vs HNSW

AspectIVF-PQHNSW
MemoryVery Low (compressed)High (original vectors)
Build TimeFastSlow
Search SpeedFastVery Fast
RecallGood (90-95%)Excellent (95-99%)
ScalabilityBillionsMillions
UpdatesBatch preferredIncremental

When to Use Each

Use IVF-PQ when:

  • Dataset has billions of vectors
  • Memory is severely limited
  • Moderate recall (90%) is acceptable
  • Batch updates are feasible

Use HNSW when:

  • Need highest recall (>95%)
  • Dataset fits in memory
  • Real-time updates required
  • Latency is critical

Advanced Topics

1. Hierarchical IVF

Two-level clustering for very large datasets:

class HierarchicalIVF: def __init__(self, d, n_coarse=100, n_fine=100): self.coarse_centroids = None self.fine_centroids = {} # Per coarse cluster def build(self, vectors): # Level 1: Coarse clustering self.coarse_centroids = kmeans(vectors, self.n_coarse) # Level 2: Fine clustering within each coarse cluster for i in range(self.n_coarse): cluster_vectors = get_cluster_vectors(vectors, i) self.fine_centroids[i] = kmeans(cluster_vectors, self.n_fine)

2. Learned Quantization

Using neural networks for better compression:

class LearnedPQ(nn.Module): def __init__(self, d, m, k): super().__init__() self.encoders = nn.ModuleList([ nn.Linear(d // m, k) for _ in range(m) ]) self.decoders = nn.ModuleList([ nn.Linear(k, d // m) for _ in range(m) ]) def forward(self, x): # Encode subvectors codes = [] for i, encoder in enumerate(self.encoders): subvec = x[:, i*d//m:(i+1)*d//m] code = encoder(subvec) codes.append(code) # Decode for reconstruction reconstructed = [] for code, decoder in zip(codes, self.decoders): reconstructed.append(decoder(code)) return torch.cat(reconstructed, dim=1)

3. Filtering and Metadata

Combining vector search with attribute filtering:

class FilteredIVFPQ: def search_with_filter(self, query, filter_func, k): # Increase nprobe to compensate for filtering self.index.nprobe = self.nprobe * 4 # Over-retrieve candidates, distances = self.index.search(query, k * 10) # Apply filter filtered_results = [] for idx, dist in zip(candidates[0], distances[0]): if filter_func(self.metadata[idx]): filtered_results.append((idx, dist)) if len(filtered_results) >= k: break return filtered_results

Best Practices

1. Training Data Selection

  • Use representative sample (not random)
  • Include edge cases and outliers
  • At least 30×nlist vectors for training

2. Parameter Tuning

def tune_ivf_pq(vectors, queries, ground_truth): best_params = {} best_score = 0 for nlist in [256, 512, 1024, 2048]: for m in [8, 16, 32, 64]: for nprobe in [1, 4, 16, 64]: index = create_index(vectors, nlist, m) index.nprobe = nprobe recall = compute_recall(index, queries, ground_truth) qps = measure_qps(index, queries) score = recall * np.log(qps) # Balance recall and speed if score > best_score: best_score = score best_params = {'nlist': nlist, 'm': m, 'nprobe': nprobe} return best_params

3. Monitoring and Debugging

  • Track quantization error during training
  • Monitor cluster imbalance
  • Measure actual compression ratio
  • Validate recall on test set

Real-World Applications

1. Image Search (Pinterest)

  • 3 billion images
  • 768d embeddings from CNN
  • IVF4096,PQ32 configuration
  • 95% recall at 100ms latency

2. Semantic Search (Spotify)

  • 100M+ songs
  • Audio embeddings
  • IVF2048,PQ64 with OPQ
  • Real-time recommendations

3. Document Retrieval (Elastic)

  • Billions of documents
  • BERT embeddings (768d)
  • Hierarchical IVF-PQ
  • Sub-second search

References

If you found this explanation helpful, consider sharing it with others.

Mastodon