Vector Quantization Techniques

12 min

Master vector compression techniques from scalar to product quantization. Learn how to reduce memory usage by 10-100× while preserving search quality.

Best viewed on desktop for optimal interactive experience

Vector Quantization Techniques

Vector quantization is the art of compressing high-dimensional vectors while preserving their essential properties, enabling billion-scale similarity search on commodity hardware.

Interactive Quantization Explorer

Visualize how different quantization techniques compress vectors and affect search quality:

Vector Quantization Methods

Compress high-dimensional vectors for efficient storage and search

Scalar Quantization Visualization

Bits:8
Quantization Levels
256
Compression Ratio
4.0:1
Max Error
0.0039

Why Quantization Matters

The Memory Challenge

Storing embeddings at scale:

ScaleVectorsDimensionFloat32 SizeQuantized (PQ32)Savings
Small1M7683 GB32 MB96×
Medium100M768300 GB3.2 GB94×
Large1B7683 TB32 GB96×
Huge10B76830 TB320 GB96×

Quantization Methods

1. Scalar Quantization (SQ)

Map each dimension to a smaller representation:

def scalar_quantize(vector, bits=8): """Quantize each dimension independently""" # Find min/max for normalization vmin, vmax = vector.min(), vector.max() # Normalize to [0, 1] normalized = (vector - vmin) / (vmax - vmin) # Quantize to n bits levels = 2 ** bits quantized = np.round(normalized * (levels - 1)).astype(np.uint8) return quantized, (vmin, vmax) def scalar_dequantize(quantized, params, bits=8): """Reconstruct from quantized values""" vmin, vmax = params levels = 2 ** bits # Denormalize normalized = quantized.astype(np.float32) / (levels - 1) reconstructed = normalized * (vmax - vmin) + vmin return reconstructed

Characteristics:

  • Simple and fast
  • 4× compression (float32 → uint8)
  • Uniform quantization error
  • No training required

2. Product Quantization (PQ)

Divide vector into subvectors and quantize independently:

x ∈ ℝd → [c1, c2, ..., cm] ∈ \{0,...,k-1\}m
class ProductQuantizer: def __init__(self, d, m, k=256): """ d: dimension m: number of subvectors k: codebook size (typically 256 for 8-bit) """ self.d = d self.m = m self.k = k self.d_sub = d // m self.codebooks = [] def train(self, vectors): """Learn codebooks using k-means""" n = len(vectors) for i in range(self.m): # Extract subvectors start = i * self.d_sub end = (i + 1) * self.d_sub subvectors = vectors[:, start:end] # Run k-means kmeans = KMeans(n_clusters=self.k) kmeans.fit(subvectors) # Store codebook self.codebooks.append(kmeans.cluster_centers_) def encode(self, vector): """Encode vector to PQ codes""" codes = [] for i in range(self.m): start = i * self.d_sub end = (i + 1) * self.d_sub subvector = vector[start:end] # Find nearest codeword distances = np.linalg.norm( self.codebooks[i] - subvector, axis=1 ) code = np.argmin(distances) codes.append(code) return np.array(codes, dtype=np.uint8) def decode(self, codes): """Reconstruct vector from codes""" vector = [] for i, code in enumerate(codes): codeword = self.codebooks[i][code] vector.append(codeword) return np.concatenate(vector)

Compression ratio:

\text{Ratio} = d × 32m × 8 = 4dm

3. Optimized Product Quantization (OPQ)

Apply rotation before PQ to minimize error:

class OptimizedPQ: def __init__(self, d, m): self.pq = ProductQuantizer(d, m) self.rotation = None def train(self, vectors): """Learn rotation and codebooks jointly""" # Initialize with random rotation self.rotation = random_orthogonal_matrix(self.d) for iteration in range(10): # Step 1: Rotate vectors rotated = vectors @ self.rotation # Step 2: Train PQ self.pq.train(rotated) # Step 3: Update rotation codes = [self.pq.encode(v) for v in rotated] reconstructed = [self.pq.decode(c) for c in codes] # Procrustes problem self.rotation = solve_procrustes(vectors, reconstructed) def encode(self, vector): rotated = vector @ self.rotation return self.pq.encode(rotated)

4. Residual Quantization (RQ)

Iteratively quantize residuals:

class ResidualQuantizer: def __init__(self, d, num_stages=4): self.stages = [] self.num_stages = num_stages def train(self, vectors): residuals = vectors.copy() for stage in range(self.num_stages): # Train quantizer on residuals quantizer = train_quantizer(residuals) self.stages.append(quantizer) # Compute new residuals quantized = quantizer.encode_decode(residuals) residuals = residuals - quantized def encode(self, vector): codes = [] residual = vector for quantizer in self.stages: code = quantizer.encode(residual) codes.append(code) # Update residual reconstructed = quantizer.decode(code) residual = residual - reconstructed return codes

5. Binary Quantization

Extreme compression to 1 bit per dimension:

def binary_quantize(vectors): """Convert to binary codes""" # Center the data mean = vectors.mean(axis=0) centered = vectors - mean # Binarize binary = (centered > 0).astype(np.uint8) # Pack bits packed = np.packbits(binary, axis=1) return packed, mean def hamming_distance(binary1, binary2): """Compute Hamming distance between binary codes""" xor = np.bitwise_xor(binary1, binary2) return np.unpackbits(xor).sum()

Asymmetric Distance Computation (ADC)

Compute distances without decompressing:

def compute_adc_table(query, codebooks): """Precompute distance table for query""" m = len(codebooks) d_sub = len(query) // m tables = [] for i in range(m): subquery = query[i*d_sub:(i+1)*d_sub] # Distance to all codewords distances = np.linalg.norm( codebooks[i] - subquery, axis=1 ) tables.append(distances) return tables def adc_distance(codes, tables): """Fast distance computation using lookup""" distance = 0 for i, code in enumerate(codes): distance += tables[i][code] return distance

Advanced Techniques

1. Additive Quantization (AQ)

Generalization of PQ without orthogonality:

x ≈ Σm=1M Cm[:, bm]

Where Cm are codebooks and bm are selected codewords.

2. Composite Quantization (CQ)

Constrain codewords to be nearly orthogonal:

class CompositeQuantizer: def train(self, vectors): # Minimize quantization error # Subject to: C_i^T C_j ≈ 0 for i ≠ j objective = quantization_error + lambda * orthogonality_penalty optimize(objective)

3. Learning-based Quantization

Use neural networks for quantization:

class NeuralQuantizer(nn.Module): def __init__(self, d, num_codes): super().__init__() self.encoder = nn.Sequential( nn.Linear(d, 256), nn.ReLU(), nn.Linear(256, num_codes) ) self.codebook = nn.Parameter(torch.randn(num_codes, d)) def forward(self, x): # Soft assignment logits = self.encoder(x) weights = F.softmax(logits, dim=-1) # Weighted combination of codewords quantized = weights @ self.codebook # Straight-through estimator for gradients quantized = x + (quantized - x).detach() return quantized

Performance Analysis

Quantization Error

Expected reconstruction error:

E[‖x - x̂‖2] = dm · k-2/dsub12

Search Quality Impact

MethodCompressionRecall@10 LossSpeed Gain
No compression0%
SQ81-2%
PQ3296×5-10%10×
OPQ3296×3-7%10×
Binary32×15-25%50×

Memory vs Accuracy Tradeoff

def estimate_recall(compression_ratio): """Empirical formula for recall degradation""" if compression_ratio <= 4: return 0.99 # Scalar quantization elif compression_ratio <= 32: return 0.95 # Moderate PQ elif compression_ratio <= 100: return 0.90 # Aggressive PQ else: return 0.80 # Extreme compression

Implementation Best Practices

1. Choosing Parameters

def choose_pq_parameters(d, memory_budget_mb, num_vectors): """Select optimal PQ parameters""" bytes_per_vector = memory_budget_mb * 1e6 / num_vectors # Each subquantizer uses 1 byte m = int(bytes_per_vector) # Ensure divisibility while d % m != 0 and m > 1: m -= 1 # Validate d_sub = d // m if d_sub > 32: print(f"Warning: subvector dimension {d_sub} is large") return m

2. Training Strategy

def train_quantizer(vectors, sample_size=1000000): """Train quantizer on representative sample""" n = len(vectors) if n <= sample_size: # Use all data training_set = vectors else: # Stratified sampling indices = stratified_sample(vectors, sample_size) training_set = vectors[indices] # Train with multiple random initializations best_quantizer = None best_error = float('inf') for seed in range(3): quantizer = ProductQuantizer(d, m) quantizer.train(training_set, seed=seed) error = compute_quantization_error(quantizer, training_set) if error < best_error: best_error = error best_quantizer = quantizer return best_quantizer

3. Hybrid Approaches

class HybridQuantizer: """Combine multiple quantization methods""" def __init__(self, d): self.coarse_quantizer = ScalarQuantizer(bits=4) self.fine_quantizer = ProductQuantizer(d, m=16) def encode(self, vector): # Coarse quantization coarse = self.coarse_quantizer.encode(vector) # Compute residual reconstructed = self.coarse_quantizer.decode(coarse) residual = vector - reconstructed # Fine quantization of residual fine = self.fine_quantizer.encode(residual) return coarse, fine

Real-World Applications

1. Facebook's Faiss

# Billion-scale image search index = faiss.index_factory( 2048, # ResNet features "OPQ64,IVF262144,PQ64", faiss.METRIC_INNER_PRODUCT )

2. Google's ScaNN

# YouTube recommendations searcher = scann.scann_ops_pybind.builder(embeddings, 10, "dot_product") \ .tree(num_leaves=2000, num_leaves_to_search=100) \ .score_ah(2, anisotropic_quantization_threshold=0.2) \ .build()

3. Microsoft's SPTAG

# Bing search index = SPTAG.AnnIndex('BKT', 'Float', dim) index.SetBuildParam("NumberOfThreads", "16") index.SetBuildParam("DistCalcMethod", "L2") index.Build(vectors)

Evaluation Metrics

1. Reconstruction Error

MSE = 1N Σi=1N ‖xi - x̂i2

2. Search Quality

def evaluate_quantization(original_index, quantized_index, queries, k=10): """Compare search quality""" # Ground truth from original gt_indices, _ = original_index.search(queries, k) # Results from quantized q_indices, _ = quantized_index.search(queries, k) # Compute recall recall = compute_recall(gt_indices, q_indices) return { 'recall@10': recall, 'compression_ratio': get_compression_ratio(), 'search_speedup': measure_speedup() }

Future Directions

1. Learned Compression

  • End-to-end training
  • Task-specific optimization
  • Adaptive bit allocation

2. Hardware Acceleration

  • AVX-512 for distance computation
  • GPU quantization kernels
  • TPU/NPU support

3. Multi-resolution Quantization

  • Progressive refinement
  • Adaptive precision
  • Query-dependent detail

References

  • Jégou et al. "Product Quantization for Nearest Neighbor Search"
  • Ge et al. "Optimized Product Quantization"
  • Martinez et al. "Revisiting Additive Quantization"
  • Guo et al. "Accelerated Nearest Neighbor Search with Composite Quantization"

If you found this explanation helpful, consider sharing it with others.

Mastodon