IVF-PQ: Inverted File with Product Quantization
Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.
Best viewed on desktop for optimal interactive experience
IVF-PQ: Inverted File with Product Quantization
IVF-PQ is a powerful technique that combines clustering-based indexing (IVF) with vector compression (PQ) to enable searching billions of vectors on a single machine with limited memory.
Interactive IVF-PQ Visualization
Explore how IVF-PQ partitions space into clusters and compresses vectors using product quantization:
IVF-PQ Visualization
Inverted File with Product Quantization - Scalable vector compression and search
Inverted File Clustering
The Billion-Scale Challenge
Storing 1 billion 768-dimensional float32 vectors requires:
IVF-PQ can reduce this to just 30-60 GB while maintaining 90%+ recall!
How IVF-PQ Works
Step 1: Inverted File Index (IVF)
Partition the vector space into clusters using k-means:
def build_ivf_index(vectors, n_clusters): # Train k-means on sample centroids = kmeans(vectors[:100000], n_clusters) # Assign vectors to nearest centroid inverted_lists = [[] for _ in range(n_clusters)] for i, vector in enumerate(vectors): cluster_id = nearest_centroid(vector, centroids) inverted_lists[cluster_id].append(i) return centroids, inverted_lists
Step 2: Product Quantization (PQ)
Compress vectors by splitting into subvectors and quantizing:
def product_quantize(vector, codebooks): m = len(codebooks) # Number of subvectors d_sub = len(vector) // m codes = [] for i in range(m): subvector = vector[i*d_sub:(i+1)*d_sub] # Find nearest codeword code = nearest_codeword(subvector, codebooks[i]) codes.append(code) return codes # m bytes instead of d*4 bytes
Step 3: Search Process
- Find candidate clusters using probe parameter
- Scan compressed vectors in selected clusters
- Compute distances using lookup tables
- Return top-k results
Mathematical Foundation
Clustering Objective
K-means minimizes within-cluster variance:
Where:
- rij is 1 if vector i belongs to cluster j
- μj is the centroid of cluster j
Quantization Error
Product quantization minimizes reconstruction error:
Where:
- x(j) is the j-th subvector
- cqj(j) is the assigned codeword
Asymmetric Distance Computation
For efficiency, compute distances without decompressing:
Pre-compute distance tables for each subvector of the query.
Key Parameters
IVF Parameters
Parameter | Typical Value | Effect |
---|---|---|
nlist | √N to N/100 | Number of clusters |
nprobe | 1-128 | Clusters to search |
train_size | 30×nlist | Training set size |
PQ Parameters
Parameter | Typical Value | Effect |
---|---|---|
M | 8-64 | Number of subvectors |
nbits | 8 | Bits per subquantizer |
codebook_size | 2^nbits | Codewords per subspace |
Performance Analysis
Memory Savings
Original size vs compressed size:
Vectors | Dimension | Original | IVF-PQ (M=32) | Compression |
---|---|---|---|---|
1M | 128 | 512 MB | 32 MB | 16× |
10M | 512 | 20 GB | 320 MB | 64× |
100M | 768 | 300 GB | 3.2 GB | 94× |
1B | 768 | 3 TB | 32 GB | 96× |
Recall vs Speed Tradeoff
nprobe | Recall | QPS | Latency -------|--------|--------|---------- 1 | 70% | 50,000 | 0.02ms 4 | 85% | 20,000 | 0.05ms 16 | 93% | 5,000 | 0.2ms 64 | 97% | 1,250 | 0.8ms 256 | 99% | 300 | 3.3ms
Implementation with Faiss
Basic IVF-PQ Index
import faiss import numpy as np def create_ivf_pq_index(vectors, nlist=1024, m=32): d = vectors.shape[1] # Create quantizer (coarse quantizer) quantizer = faiss.IndexFlatL2(d) # Create IVF-PQ index index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8) # Train on subset train_size = max(nlist * 30, 100000) train_data = vectors[np.random.choice(len(vectors), train_size)] index.train(train_data) # Add vectors index.add(vectors) return index
Optimized Search
def search_ivf_pq(index, queries, k=10, nprobe=16): # Set search parameters index.nprobe = nprobe # Search distances, indices = index.search(queries, k) return indices, distances # Example usage index = create_ivf_pq_index(database_vectors) results, scores = search_ivf_pq(index, query_vectors)
Advanced Configuration
# IVF-PQ with preprocessing index = faiss.index_factory( d, "OPQ32,IVF1024,PQ32x8", faiss.METRIC_L2 ) # Components: # - OPQ32: Optimized Product Quantization rotation # - IVF1024: 1024 clusters # - PQ32x8: 32 subvectors, 8 bits each
Optimization Strategies
1. Optimized Product Quantization (OPQ)
Apply rotation to minimize quantization error:
def train_opq(vectors, m): # Learn rotation matrix R R = learn_rotation_matrix(vectors, m) # Rotate vectors rotated_vectors = vectors @ R # Train PQ on rotated space pq = train_product_quantizer(rotated_vectors, m) return R, pq
2. Multi-Index Hashing
Use multiple hash tables for better recall:
class MultiIndexIVF: def __init__(self, d, n_tables=4): self.tables = [] for _ in range(n_tables): # Different random projections self.tables.append(create_ivf_index(d)) def search(self, query, k): candidates = set() for table in self.tables: results = table.search(query, k * 2) candidates.update(results) # Rerank candidates return rerank(query, candidates, k)
3. GPU Acceleration
# Move index to GPU gpu_res = faiss.StandardGpuResources() gpu_index = faiss.index_cpu_to_gpu(gpu_res, 0, cpu_index) # Batch search on GPU batch_size = 1000 results = gpu_index.search(queries[:batch_size], k)
Comparison with Other Methods
IVF-PQ vs HNSW
Aspect | IVF-PQ | HNSW |
---|---|---|
Memory | Very Low (compressed) | High (original vectors) |
Build Time | Fast | Slow |
Search Speed | Fast | Very Fast |
Recall | Good (90-95%) | Excellent (95-99%) |
Scalability | Billions | Millions |
Updates | Batch preferred | Incremental |
When to Use Each
Use IVF-PQ when:
- Dataset has billions of vectors
- Memory is severely limited
- Moderate recall (90%) is acceptable
- Batch updates are feasible
Use HNSW when:
- Need highest recall (>95%)
- Dataset fits in memory
- Real-time updates required
- Latency is critical
Advanced Topics
1. Hierarchical IVF
Two-level clustering for very large datasets:
class HierarchicalIVF: def __init__(self, d, n_coarse=100, n_fine=100): self.coarse_centroids = None self.fine_centroids = {} # Per coarse cluster def build(self, vectors): # Level 1: Coarse clustering self.coarse_centroids = kmeans(vectors, self.n_coarse) # Level 2: Fine clustering within each coarse cluster for i in range(self.n_coarse): cluster_vectors = get_cluster_vectors(vectors, i) self.fine_centroids[i] = kmeans(cluster_vectors, self.n_fine)
2. Learned Quantization
Using neural networks for better compression:
class LearnedPQ(nn.Module): def __init__(self, d, m, k): super().__init__() self.encoders = nn.ModuleList([ nn.Linear(d // m, k) for _ in range(m) ]) self.decoders = nn.ModuleList([ nn.Linear(k, d // m) for _ in range(m) ]) def forward(self, x): # Encode subvectors codes = [] for i, encoder in enumerate(self.encoders): subvec = x[:, i*d//m:(i+1)*d//m] code = encoder(subvec) codes.append(code) # Decode for reconstruction reconstructed = [] for code, decoder in zip(codes, self.decoders): reconstructed.append(decoder(code)) return torch.cat(reconstructed, dim=1)
3. Filtering and Metadata
Combining vector search with attribute filtering:
class FilteredIVFPQ: def search_with_filter(self, query, filter_func, k): # Increase nprobe to compensate for filtering self.index.nprobe = self.nprobe * 4 # Over-retrieve candidates, distances = self.index.search(query, k * 10) # Apply filter filtered_results = [] for idx, dist in zip(candidates[0], distances[0]): if filter_func(self.metadata[idx]): filtered_results.append((idx, dist)) if len(filtered_results) >= k: break return filtered_results
Best Practices
1. Training Data Selection
- Use representative sample (not random)
- Include edge cases and outliers
- At least 30×nlist vectors for training
2. Parameter Tuning
def tune_ivf_pq(vectors, queries, ground_truth): best_params = {} best_score = 0 for nlist in [256, 512, 1024, 2048]: for m in [8, 16, 32, 64]: for nprobe in [1, 4, 16, 64]: index = create_index(vectors, nlist, m) index.nprobe = nprobe recall = compute_recall(index, queries, ground_truth) qps = measure_qps(index, queries) score = recall * np.log(qps) # Balance recall and speed if score > best_score: best_score = score best_params = {'nlist': nlist, 'm': m, 'nprobe': nprobe} return best_params
3. Monitoring and Debugging
- Track quantization error during training
- Monitor cluster imbalance
- Measure actual compression ratio
- Validate recall on test set
Real-World Applications
1. Image Search (Pinterest)
- 3 billion images
- 768d embeddings from CNN
- IVF4096,PQ32 configuration
- 95% recall at 100ms latency
2. Semantic Search (Spotify)
- 100M+ songs
- Audio embeddings
- IVF2048,PQ64 with OPQ
- Real-time recommendations
3. Document Retrieval (Elastic)
- Billions of documents
- BERT embeddings (768d)
- Hierarchical IVF-PQ
- Sub-second search
Related Concepts
- HNSW Search - Graph-based alternative
- LSH Search - Hash-based approach
- Vector Quantization - Compression techniques
- Dense Embeddings - Vector representations
References
- Jégou et al. "Product Quantization for Nearest Neighbor Search"
- Faiss: A Library for Efficient Similarity Search
- Billion-scale similarity search with GPUs
- The Inverted Multi-Index