Quantization Effects Simulator

10 min

Explore memory-accuracy trade-offs in embedding quantization from float32 to binary representations.

Best viewed on desktop for optimal interactive experience

Quantization Effects Simulator

Quantization reduces the precision of embedding values to save memory and accelerate computation, with controllable trade-offs in accuracy.

Interactive Quantization Simulator

Quantization Effects Simulator

Explore precision-performance trade-offs with interactive visualizations

Quantization Configuration

INT8

8-bit integer quantization

Range:[-128, 127]
Model size: 2.3 MB (FP32)

Live Matrix Quantization

0.009
0.009
Error: 0.0000 (0.0%)
-0.691
-0.691
Error: 0.0000 (0.0%)
0.503
0.503
Error: 0.0000 (0.0%)
0.789
0.789
Error: 0.0000 (0.0%)
0.495
0.495
Error: 0.0000 (0.0%)
0.413
0.413
Error: 0.0000 (0.0%)
-0.432
-0.432
Error: 0.0000 (0.0%)
-0.614
-0.614
Error: 0.0000 (0.0%)
-0.439
-0.439
Error: 0.0000 (0.0%)
0.698
0.698
Error: 0.0000 (0.0%)
-0.798
-0.798
Error: 0.0000 (0.0%)
-0.800
-0.800
Error: 0.0000 (0.0%)
0.772
0.772
Error: 0.0000 (0.0%)
-0.379
-0.379
Error: 0.0000 (0.0%)
0.868
0.868
Error: 0.0000 (0.0%)
-0.910
-0.910
Error: 0.0000 (0.0%)
RMSE
0.0000
Max Error
0.0000
Cosine Sim
100.00%
Outliers
0/16

Quantization Schemes

Zero
Min
Max

Mixed Precision Strategy

Layer-wise Precision

Total: 0.73 MB
attention
0.17 MB
ffn
0.11 MB
embeddings
0.23 MB
output
0.23 MB

Hardware-Specific Performance

FP32 (Baseline)
1.0x
FP16 (Half)
1.2x
BF16 (Brain Float)
1.3x
INT8
3.2x
INT4
5.5x
INT2 (Ternary)
10.0x
Binary
20.0x
Hardware Notes:
  • • AVX-512 VNNI for INT8 acceleration
  • • Best for batch inference

Calibration & Optimization

99.9%
Aggressive (95%)Conservative (100%)
Calibration Impact:
  • 99.9% clips 0.1% of outliers
  • • Better quantization range utilization
  • • May affect model accuracy on edge cases

Comprehensive Method Comparison

MethodBitsMemoryAccuracyCPUGPUMobileBest For
FP32 (Baseline)
32
100%
100%1.0x1.0x0.3xTraining, research
FP16 (Half)
16
50%
99.7%1.2x2.5x1.5xGPU inference, fine-tuning
BF16 (Brain Float)
16
50%
99.5%1.3x2.8x1.6xTPU training, mixed precision
INT8
8
25%
98.5%3.2x4.5x4.0xProduction, cloud serving
INT4
4
13%
96.2%5.5x8.0x7.0xMobile apps, edge devices
INT2 (Ternary)
2
6%
92%10.0x15.0x12.0xUltra-low power, IoT
Binary
1
3%
87%20.0x32.0x25.0xExtreme compression, similarity

Implementation Recommendations

Getting Started

  • • Start with INT8 for balanced trade-off
  • • Use symmetric quantization initially
  • • Profile on target hardware
  • • Calibrate with representative data

Common Pitfalls

  • • Not handling outliers properly
  • • Ignoring hardware capabilities
  • • Over-aggressive quantization
  • • Poor calibration dataset

Advanced Techniques

  • • Mixed precision per layer
  • • Quantization-aware training
  • • Learned quantization params
  • • Knowledge distillation

Understanding Quantization

Quantization maps continuous values to discrete levels:

Q(x) = \text{round}(x - \text{min}\text{scale}) × \text{scale} + \text{min}

Where:

  • \text{scale} = \text{max} - \text{min}2\text{bits} - 1
  • Lower bits = fewer discrete levels
  • Higher compression = more information loss

Quantization Methods

1. Float16 (Half Precision)

16 bits: 1 sign + 5 exponent + 10 mantissa

Original: 0.123456789 (float32) Quantized: 0.1235 (float16) Memory: 50% reduction Accuracy: ~99.5% preserved

2. Int8 Quantization

8 bits: Maps to [-128, 127]

def quantize_int8(x, scale, zero_point): # Affine quantization q = np.round(x / scale + zero_point) q = np.clip(q, -128, 127).astype(np.int8) return q def dequantize_int8(q, scale, zero_point): return scale * (q - zero_point)

3. Int4 Quantization

4 bits: Maps to [-8, 7]

  • 93.75% memory reduction
  • Good for inference on edge devices
  • Requires careful calibration

4. Binary Quantization

1 bit: Only sign matters

def binary_quantize(x): return np.sign(x) # Returns -1 or 1 # Similarity in binary space def binary_similarity(b1, b2): # Hamming distance return np.sum(b1 == b2) / len(b1)

Quantization Schemes

Symmetric vs Asymmetric

Symmetric Quantization:

# Zero point at origin scale = max(abs(x_min), abs(x_max)) / (2^(bits-1) - 1) q = round(x / scale)

Asymmetric Quantization:

# Arbitrary zero point scale = (x_max - x_min) / (2^bits - 1) zero_point = round(-x_min / scale) q = round(x / scale) + zero_point

Per-Tensor vs Per-Channel

# Per-tensor: Single scale for entire tensor scale = compute_scale(tensor) quantized = quantize(tensor, scale) # Per-channel: Different scale per dimension scales = [compute_scale(tensor[i]) for i in range(channels)] quantized = [quantize(tensor[i], scales[i]) for i in range(channels)]

Implementation Examples

Post-Training Quantization

import torch import torch.nn as nn def quantize_model_weights(model, bits=8): """Quantize model after training""" for name, param in model.named_parameters(): if 'weight' in name: # Calculate quantization parameters min_val = param.min() max_val = param.max() scale = (max_val - min_val) / (2**bits - 1) zero_point = -min_val / scale # Quantize and dequantize quantized = torch.round(param / scale + zero_point) quantized = torch.clamp(quantized, 0, 2**bits - 1) dequantized = (quantized - zero_point) * scale # Replace weights param.data = dequantized

Quantization-Aware Training

class QuantizedLinear(nn.Module): def __init__(self, in_features, out_features, bits=8): super().__init__() self.weight = nn.Parameter(torch.randn(out_features, in_features)) self.bits = bits def forward(self, x): # Fake quantization during training if self.training: # Compute scale w_min, w_max = self.weight.min(), self.weight.max() scale = (w_max - w_min) / (2**self.bits - 1) # Quantize and dequantize w_quant = torch.round(self.weight / scale) * scale # Straight-through estimator for gradients w_quant = self.weight + (w_quant - self.weight).detach() else: w_quant = self.weight return F.linear(x, w_quant)

Performance Analysis

Memory Savings

MethodBitsMemoryRelative Size
Float3232100%1.00×
Float161650%0.50×
Int8825%0.25×
Int4412.5%0.125×
Binary13.125%0.03125×

Accuracy Impact

Typical accuracy retention:

Float32 → Float16: 99.5% Float32 → Int8: 98-99% Float32 → Int4: 95-97% Float32 → Binary: 85-90%

Speed Improvements

# Benchmark example import time def benchmark_inference(model, input_data, quantized=False): if quantized: model = quantize_model(model) start = time.time() with torch.no_grad(): for _ in range(1000): output = model(input_data) return time.time() - start # Results (typical) # Float32: 1.0s # Int8: 0.3s (3.3× faster) # Int4: 0.2s (5× faster)

Advanced Techniques

1. Mixed Precision

Different precision for different layers:

config = { 'attention': 8, # Int8 for attention 'ffn': 4, # Int4 for feed-forward 'embeddings': 16 # Float16 for embeddings }

2. Dynamic Quantization

Quantize activations on-the-fly:

model = torch.quantization.quantize_dynamic( model, {nn.Linear}, # Layers to quantize dtype=torch.qint8 )

3. Learned Quantization

Learn optimal quantization parameters:

class LearnedQuantizer(nn.Module): def __init__(self, bits=8): super().__init__() self.scale = nn.Parameter(torch.ones(1)) self.zero_point = nn.Parameter(torch.zeros(1)) self.bits = bits def forward(self, x): # Learned affine transformation q = torch.round(x / self.scale + self.zero_point) q = torch.clamp(q, 0, 2**self.bits - 1) return (q - self.zero_point) * self.scale

Quantization for Embeddings

Embedding Table Quantization

class QuantizedEmbedding(nn.Module): def __init__(self, num_embeddings, embedding_dim, bits=8): super().__init__() # Store quantized embeddings self.embeddings = nn.Parameter( torch.randint(0, 2**bits, (num_embeddings, embedding_dim), dtype=torch.uint8) ) self.scale = nn.Parameter(torch.ones(embedding_dim)) self.zero_point = nn.Parameter(torch.zeros(embedding_dim)) def forward(self, indices): # Lookup and dequantize quantized = self.embeddings[indices] return (quantized - self.zero_point) * self.scale

Product Quantization

Split vectors and quantize separately:

def product_quantization(vectors, num_subvectors=8, bits=8): """Quantize vectors using product quantization""" D = vectors.shape[1] d = D // num_subvectors quantized = [] codebooks = [] for i in range(num_subvectors): # Extract subvector subvecs = vectors[:, i*d:(i+1)*d] # Learn codebook (k-means) kmeans = KMeans(n_clusters=2**bits) labels = kmeans.fit_predict(subvecs) quantized.append(labels) codebooks.append(kmeans.cluster_centers_) return quantized, codebooks

Best Practices

1. Calibration

Determine optimal scale/zero-point:

def calibrate_quantization(data_loader, model): """Find optimal quantization parameters""" min_vals, max_vals = {}, {} for batch in data_loader: output = model(batch) for name, tensor in model.named_parameters(): if name not in min_vals: min_vals[name] = tensor.min() max_vals[name] = tensor.max() else: min_vals[name] = min(min_vals[name], tensor.min()) max_vals[name] = max(max_vals[name], tensor.max()) return min_vals, max_vals

2. Outlier Handling

def clip_outliers(tensor, percentile=99.9): """Clip outliers before quantization""" threshold = np.percentile(abs(tensor), percentile) return np.clip(tensor, -threshold, threshold)

3. Error Compensation

def quantize_with_error_compensation(weights, bits=8): """Accumulate and compensate quantization errors""" error = 0 quantized = [] for w in weights: # Add accumulated error w_compensated = w + error # Quantize q = quantize(w_compensated, bits) # Compute new error error = w_compensated - q quantized.append(q) return quantized

Deployment Considerations

Mobile/Edge Deployment

# TensorFlow Lite example import tensorflow as tf converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.representative_dataset = representative_dataset converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.int8 converter.inference_output_type = tf.int8 tflite_model = converter.convert()

Hardware Acceleration

  • ARM: Int8 with NEON
  • x86: Int8 with AVX512 VNNI
  • GPU: Int8 Tensor Cores
  • TPU: Bfloat16 native

References

  • Jacob et al. "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference"
  • Gholami et al. "A Survey of Quantization Methods for Efficient Neural Network Inference"
  • Dettmers et al. "8-bit Optimizers via Block-wise Quantization"
  • Zafrir et al. "Q8BERT: Quantized 8Bit BERT"

If you found this explanation helpful, consider sharing it with others.

Mastodon