Vector Quantization Techniques
Master vector compression techniques from scalar to product quantization. Learn how to reduce memory usage by 10-100× while preserving search quality.
Best viewed on desktop for optimal interactive experience
Vector Quantization Techniques
Vector quantization is the art of compressing high-dimensional vectors while preserving their essential properties, enabling billion-scale similarity search on commodity hardware.
Interactive Quantization Explorer
Visualize how different quantization techniques compress vectors and affect search quality:
Vector Quantization Methods
Compress high-dimensional vectors for efficient storage and search
Scalar Quantization Visualization
Why Quantization Matters
The Memory Challenge
Storing embeddings at scale:
Scale | Vectors | Dimension | Float32 Size | Quantized (PQ32) | Savings |
---|---|---|---|---|---|
Small | 1M | 768 | 3 GB | 32 MB | 96× |
Medium | 100M | 768 | 300 GB | 3.2 GB | 94× |
Large | 1B | 768 | 3 TB | 32 GB | 96× |
Huge | 10B | 768 | 30 TB | 320 GB | 96× |
Quantization Methods
1. Scalar Quantization (SQ)
Map each dimension to a smaller representation:
def scalar_quantize(vector, bits=8): """Quantize each dimension independently""" # Find min/max for normalization vmin, vmax = vector.min(), vector.max() # Normalize to [0, 1] normalized = (vector - vmin) / (vmax - vmin) # Quantize to n bits levels = 2 ** bits quantized = np.round(normalized * (levels - 1)).astype(np.uint8) return quantized, (vmin, vmax) def scalar_dequantize(quantized, params, bits=8): """Reconstruct from quantized values""" vmin, vmax = params levels = 2 ** bits # Denormalize normalized = quantized.astype(np.float32) / (levels - 1) reconstructed = normalized * (vmax - vmin) + vmin return reconstructed
Characteristics:
- Simple and fast
- 4× compression (float32 → uint8)
- Uniform quantization error
- No training required
2. Product Quantization (PQ)
Divide vector into subvectors and quantize independently:
class ProductQuantizer: def __init__(self, d, m, k=256): """ d: dimension m: number of subvectors k: codebook size (typically 256 for 8-bit) """ self.d = d self.m = m self.k = k self.d_sub = d // m self.codebooks = [] def train(self, vectors): """Learn codebooks using k-means""" n = len(vectors) for i in range(self.m): # Extract subvectors start = i * self.d_sub end = (i + 1) * self.d_sub subvectors = vectors[:, start:end] # Run k-means kmeans = KMeans(n_clusters=self.k) kmeans.fit(subvectors) # Store codebook self.codebooks.append(kmeans.cluster_centers_) def encode(self, vector): """Encode vector to PQ codes""" codes = [] for i in range(self.m): start = i * self.d_sub end = (i + 1) * self.d_sub subvector = vector[start:end] # Find nearest codeword distances = np.linalg.norm( self.codebooks[i] - subvector, axis=1 ) code = np.argmin(distances) codes.append(code) return np.array(codes, dtype=np.uint8) def decode(self, codes): """Reconstruct vector from codes""" vector = [] for i, code in enumerate(codes): codeword = self.codebooks[i][code] vector.append(codeword) return np.concatenate(vector)
Compression ratio:
3. Optimized Product Quantization (OPQ)
Apply rotation before PQ to minimize error:
class OptimizedPQ: def __init__(self, d, m): self.pq = ProductQuantizer(d, m) self.rotation = None def train(self, vectors): """Learn rotation and codebooks jointly""" # Initialize with random rotation self.rotation = random_orthogonal_matrix(self.d) for iteration in range(10): # Step 1: Rotate vectors rotated = vectors @ self.rotation # Step 2: Train PQ self.pq.train(rotated) # Step 3: Update rotation codes = [self.pq.encode(v) for v in rotated] reconstructed = [self.pq.decode(c) for c in codes] # Procrustes problem self.rotation = solve_procrustes(vectors, reconstructed) def encode(self, vector): rotated = vector @ self.rotation return self.pq.encode(rotated)
4. Residual Quantization (RQ)
Iteratively quantize residuals:
class ResidualQuantizer: def __init__(self, d, num_stages=4): self.stages = [] self.num_stages = num_stages def train(self, vectors): residuals = vectors.copy() for stage in range(self.num_stages): # Train quantizer on residuals quantizer = train_quantizer(residuals) self.stages.append(quantizer) # Compute new residuals quantized = quantizer.encode_decode(residuals) residuals = residuals - quantized def encode(self, vector): codes = [] residual = vector for quantizer in self.stages: code = quantizer.encode(residual) codes.append(code) # Update residual reconstructed = quantizer.decode(code) residual = residual - reconstructed return codes
5. Binary Quantization
Extreme compression to 1 bit per dimension:
def binary_quantize(vectors): """Convert to binary codes""" # Center the data mean = vectors.mean(axis=0) centered = vectors - mean # Binarize binary = (centered > 0).astype(np.uint8) # Pack bits packed = np.packbits(binary, axis=1) return packed, mean def hamming_distance(binary1, binary2): """Compute Hamming distance between binary codes""" xor = np.bitwise_xor(binary1, binary2) return np.unpackbits(xor).sum()
Asymmetric Distance Computation (ADC)
Compute distances without decompressing:
def compute_adc_table(query, codebooks): """Precompute distance table for query""" m = len(codebooks) d_sub = len(query) // m tables = [] for i in range(m): subquery = query[i*d_sub:(i+1)*d_sub] # Distance to all codewords distances = np.linalg.norm( codebooks[i] - subquery, axis=1 ) tables.append(distances) return tables def adc_distance(codes, tables): """Fast distance computation using lookup""" distance = 0 for i, code in enumerate(codes): distance += tables[i][code] return distance
Advanced Techniques
1. Additive Quantization (AQ)
Generalization of PQ without orthogonality:
Where Cm are codebooks and bm are selected codewords.
2. Composite Quantization (CQ)
Constrain codewords to be nearly orthogonal:
class CompositeQuantizer: def train(self, vectors): # Minimize quantization error # Subject to: C_i^T C_j ≈ 0 for i ≠ j objective = quantization_error + lambda * orthogonality_penalty optimize(objective)
3. Learning-based Quantization
Use neural networks for quantization:
class NeuralQuantizer(nn.Module): def __init__(self, d, num_codes): super().__init__() self.encoder = nn.Sequential( nn.Linear(d, 256), nn.ReLU(), nn.Linear(256, num_codes) ) self.codebook = nn.Parameter(torch.randn(num_codes, d)) def forward(self, x): # Soft assignment logits = self.encoder(x) weights = F.softmax(logits, dim=-1) # Weighted combination of codewords quantized = weights @ self.codebook # Straight-through estimator for gradients quantized = x + (quantized - x).detach() return quantized
Performance Analysis
Quantization Error
Expected reconstruction error:
Search Quality Impact
Method | Compression | Recall@10 Loss | Speed Gain |
---|---|---|---|
No compression | 1× | 0% | 1× |
SQ8 | 4× | 1-2% | 2× |
PQ32 | 96× | 5-10% | 10× |
OPQ32 | 96× | 3-7% | 10× |
Binary | 32× | 15-25% | 50× |
Memory vs Accuracy Tradeoff
def estimate_recall(compression_ratio): """Empirical formula for recall degradation""" if compression_ratio <= 4: return 0.99 # Scalar quantization elif compression_ratio <= 32: return 0.95 # Moderate PQ elif compression_ratio <= 100: return 0.90 # Aggressive PQ else: return 0.80 # Extreme compression
Implementation Best Practices
1. Choosing Parameters
def choose_pq_parameters(d, memory_budget_mb, num_vectors): """Select optimal PQ parameters""" bytes_per_vector = memory_budget_mb * 1e6 / num_vectors # Each subquantizer uses 1 byte m = int(bytes_per_vector) # Ensure divisibility while d % m != 0 and m > 1: m -= 1 # Validate d_sub = d // m if d_sub > 32: print(f"Warning: subvector dimension {d_sub} is large") return m
2. Training Strategy
def train_quantizer(vectors, sample_size=1000000): """Train quantizer on representative sample""" n = len(vectors) if n <= sample_size: # Use all data training_set = vectors else: # Stratified sampling indices = stratified_sample(vectors, sample_size) training_set = vectors[indices] # Train with multiple random initializations best_quantizer = None best_error = float('inf') for seed in range(3): quantizer = ProductQuantizer(d, m) quantizer.train(training_set, seed=seed) error = compute_quantization_error(quantizer, training_set) if error < best_error: best_error = error best_quantizer = quantizer return best_quantizer
3. Hybrid Approaches
class HybridQuantizer: """Combine multiple quantization methods""" def __init__(self, d): self.coarse_quantizer = ScalarQuantizer(bits=4) self.fine_quantizer = ProductQuantizer(d, m=16) def encode(self, vector): # Coarse quantization coarse = self.coarse_quantizer.encode(vector) # Compute residual reconstructed = self.coarse_quantizer.decode(coarse) residual = vector - reconstructed # Fine quantization of residual fine = self.fine_quantizer.encode(residual) return coarse, fine
Real-World Applications
1. Facebook's Faiss
# Billion-scale image search index = faiss.index_factory( 2048, # ResNet features "OPQ64,IVF262144,PQ64", faiss.METRIC_INNER_PRODUCT )
2. Google's ScaNN
# YouTube recommendations searcher = scann.scann_ops_pybind.builder(embeddings, 10, "dot_product") \ .tree(num_leaves=2000, num_leaves_to_search=100) \ .score_ah(2, anisotropic_quantization_threshold=0.2) \ .build()
3. Microsoft's SPTAG
# Bing search index = SPTAG.AnnIndex('BKT', 'Float', dim) index.SetBuildParam("NumberOfThreads", "16") index.SetBuildParam("DistCalcMethod", "L2") index.Build(vectors)
Evaluation Metrics
1. Reconstruction Error
2. Search Quality
def evaluate_quantization(original_index, quantized_index, queries, k=10): """Compare search quality""" # Ground truth from original gt_indices, _ = original_index.search(queries, k) # Results from quantized q_indices, _ = quantized_index.search(queries, k) # Compute recall recall = compute_recall(gt_indices, q_indices) return { 'recall@10': recall, 'compression_ratio': get_compression_ratio(), 'search_speedup': measure_speedup() }
Future Directions
1. Learned Compression
- End-to-end training
- Task-specific optimization
- Adaptive bit allocation
2. Hardware Acceleration
- AVX-512 for distance computation
- GPU quantization kernels
- TPU/NPU support
3. Multi-resolution Quantization
- Progressive refinement
- Adaptive precision
- Query-dependent detail
Related Concepts
- IVF-PQ - Combining with clustering
- Dense Embeddings - What we're compressing
- ANN Comparison - Impact on search
- Index Structures - Storage organization
References
- Jégou et al. "Product Quantization for Nearest Neighbor Search"
- Ge et al. "Optimized Product Quantization"
- Martinez et al. "Revisiting Additive Quantization"
- Guo et al. "Accelerated Nearest Neighbor Search with Composite Quantization"