Quantization Deep Dive: From FP32 to INT4 - The Complete Guide

Master neural network quantization with interactive visualizations. Explore QAT, PTQ, GPTQ, AWQ, and SmoothQuant methods for efficient model deployment.

Abhik SarkarAbhik Sarkar
30 min

Best viewed on desktop for optimal interactive experience

Introduction

As neural networks grow from millions to billions of parameters, deployment becomes increasingly challenging. A 7B parameter model in FP32 requires 28GB of memory just for weights - exceeding most consumer GPUs. Enter quantization: the art of reducing numerical precision while preserving model accuracy.

This deep dive explores the journey from FP32 to INT4, examining state-of-the-art quantization techniques that enable running GPT-scale models on edge devices. Through interactive visualizations, we'll understand how modern quantization methods achieve 8x compression with minimal accuracy loss.

Interactive Learning: This article features 10+ interactive demos to help you understand quantization concepts. Each visualization lets you experiment with parameters and see their effects in real-time.

The Quantization Landscape

Quantization transforms high-precision floating-point weights and activations into lower-precision representations. But it's not just about reducing bits - it's about intelligently preserving the information that matters most for model performance.

Precision Comparison: FP32 to INT4

Bit Representation

32 bits used
Range:±3.4×10³⁸
Precision:~7 digits

Quantization Effect

Original (FP32)
3.14159000
Quantized (FP32)
3.14159000

Model Size Impact (7B Parameter Model)

FP32
28.0 GB
100.0x smaller
FP16
14.0 GB
50.0x smaller
INT8
7.0 GB
25.0x smaller
INT4
3.5 GB
12.5x smaller
Memory
100%
Speed
1x
Compression
1.0x

Understanding Numerical Precision

Before diving into quantization methods, let's understand what we're actually compressing and why it works.

Floating Point vs Integer Representation

FP32 (Float32): 1 sign bit, 8 exponent bits, 23 mantissa bits

  • Range: ±3.4 × 10³⁸
  • Precision: ~7 decimal digits
  • Memory: 4 bytes per weight

FP16 (Float16): 1 sign bit, 5 exponent bits, 10 mantissa bits

  • Range: ±65,504
  • Precision: ~3 decimal digits
  • Memory: 2 bytes per weight

INT8: 8-bit signed integer

  • Range: -128 to 127
  • Precision: Exact integers
  • Memory: 1 byte per weight

INT4: 4-bit signed integer

  • Range: -8 to 7
  • Precision: Exact integers
  • Memory: 0.5 bytes per weight

Why Quantization Works

Neural networks are surprisingly robust to reduced precision because:

  1. Redundancy: Networks have redundant parameters
  2. Noise Tolerance: Training introduces noise resilience
  3. Limited Precision Need: Most weights cluster around zero
  4. Activation Patterns: Only certain neurons fire for given inputs

Weight Distribution Analyzer

Mean
0.0000
Std Dev
0.0200
Skewness
0.0000

Weights concentrated near zero are ideal for quantization. Outliers (far from zero) can cause accuracy loss.

Quantization Fundamentals

The Quantization Equation

The core of quantization is a simple linear transformation:

Quantized = round(Original / Scale + ZeroPoint) Dequantized = (Quantized - ZeroPoint) × Scale

Where:

  • Scale: Determines the step size between quantized values
  • Zero Point: Aligns the quantization grid with the data distribution

Symmetric vs Asymmetric Quantization

Symmetric Quantization:

  • Zero point is always 0
  • Range: [-127, 127] for INT8
  • Simpler hardware implementation
  • May waste range if distribution is skewed

Asymmetric Quantization:

  • Zero point can be any value
  • Range: [-128, 127] for INT8
  • Better utilization of quantization range
  • More complex but often more accurate

Quantization Error Analysis

2-bit4-bit6-bit8-bit
Quantization Levels
256
Scale Factor
0.0529
Zero Point
0
Avg Error
0.6696
RMSE
1.2894

Post-Training Quantization (PTQ)

PTQ quantizes an already-trained model without retraining. It's fast and simple but may suffer accuracy loss for aggressive quantization.

Basic PTQ Pipeline

  1. Calibration: Run representative data through the model
  2. Statistics Collection: Gather min/max or percentile statistics
  3. Scale Calculation: Compute optimal scales for each layer
  4. Quantization: Convert weights and activations
  5. Validation: Check accuracy degradation

Calibration Methods

Min-Max Calibration:

def minmax_calibration(tensor): min_val = tensor.min() max_val = tensor.max() scale = (max_val - min_val) / 255 # For INT8 zero_point = round(-min_val / scale) return scale, zero_point

Percentile Calibration:

def percentile_calibration(tensor, percentile=99.9): min_val = torch.quantile(tensor, (100 - percentile) / 100) max_val = torch.quantile(tensor, percentile / 100) scale = (max_val - min_val) / 255 zero_point = round(-min_val / scale) return scale, zero_point

Entropy Calibration (KL Divergence):

def entropy_calibration(tensor, num_bins=2048): # Build histogram hist, bins = torch.histogram(tensor, bins=num_bins) # Find threshold that minimizes KL divergence best_threshold = find_optimal_threshold(hist, bins) scale = best_threshold / 127 # Symmetric quantization return scale, 0

Quantization-Aware Training (QAT)

QAT simulates quantization during training, allowing the network to adapt to reduced precision. This typically yields better accuracy than PTQ, especially for low-bit quantization.

QAT vs PTQ Comparison

QAT Forward Pass

During QAT, we inject fake quantization operations:

class FakeQuantize(nn.Module): def __init__(self, num_bits=8): super().__init__() self.num_bits = num_bits self.scale = nn.Parameter(torch.tensor(1.0)) self.zero_point = nn.Parameter(torch.tensor(0.0)) def forward(self, x): if self.training: # Fake quantize: quantize then dequantize x_q = torch.round(x / self.scale + self.zero_point) x_q = torch.clamp(x_q, 0, 2**self.num_bits - 1) x_dq = (x_q - self.zero_point) * self.scale # Straight-through estimator for gradients return x + (x_dq - x).detach() else: # Real quantization during inference return self.quantize(x)

Learnable Quantization Parameters

Modern QAT methods learn optimal scales and zero points:

class LearnedQuantization(nn.Module): def __init__(self, num_features): super().__init__() self.scale = nn.Parameter(torch.ones(num_features)) self.zero_point = nn.Parameter(torch.zeros(num_features)) def forward(self, x): # Per-channel quantization scale = F.softplus(self.scale) # Ensure positive zero_point = self.zero_point x_q = torch.round(x / scale + zero_point) x_q = torch.clamp(x_q, -128, 127) return (x_q - zero_point) * scale

Advanced Quantization Methods

1. GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ uses layer-wise quantization with Hessian-based optimization to minimize reconstruction error. It's particularly effective for large language models.

GPTQ Layer-wise Quantization

Layer Processing

Layer 1
4096 weights
Layer 2
4096 weights
Layer 3
4096 weights
Layer 4
4096 weights

Block-wise Optimization

Hessian Matrix (H = X^T X)
1.00
0.11
0.04
0.01
0.11
1.00
0.11
0.04
0.04
0.11
1.00
0.11
0.01
0.04
0.11
1.00
Optimization Objective
min ||WX - W_q X||²
W: Original weights, W_q: Quantized weights
Target Bits
4
Compression
8x
Quantized
0/4
Avg Error
0.0000

The GPTQ Algorithm

GPTQ solves an optimization problem for each layer:

minimize ||WX - W_quantized × X||²

Key innovations:

  1. Layer-wise Quantization: Process one layer at a time
  2. Hessian Awareness: Use second-order information
  3. Lazy Batch Updates: Update weights in blocks
  4. Cholesky Decomposition: Efficient inverse computation
def gptq_quantize_layer(W, X, num_bits=4): """ W: Weight matrix [out_features, in_features] X: Calibration data [batch_size, in_features] """ # Compute Hessian H = X^T X H = X.T @ X # Add damping for numerical stability H_inv = torch.inverse(H + 0.01 * torch.eye(H.shape[0])) # Initialize quantized weights W_q = torch.zeros_like(W) # Process weights in blocks block_size = 128 for i in range(0, W.shape[1], block_size): block = W[:, i:i+block_size] # Compute optimal quantization scale = compute_optimal_scale(block) W_q[:, i:i+block_size] = quantize(block, scale, num_bits) # Update remaining weights to compensate if i + block_size < W.shape[1]: error = (block - W_q[:, i:i+block_size]) @ H_inv[i:i+block_size, i+block_size:] W[:, i+block_size:] -= error return W_q

2. AWQ (Activation-aware Weight Quantization)

AWQ recognizes that not all weights are equally important - those processing salient activations need higher precision.

AWQ: Activation-aware Weight Quantization

Collect Activations
Compute Importance
Apply Scaling
Quantize Weights

Activation Heatmap (16 channels × 8 spatial)

Channel 0 Analysis

Avg Activation0.508
Max Activation1.673
Importance---
Scale Factor---
Select Channel:
AWQ Key Insight: Weights corresponding to larger activations are more important. By scaling these weights up before quantization and scaling activations down accordingly, we preserve model accuracy while achieving aggressive quantization.
Bit Width
16
Compression
1x
Accuracy
100%

AWQ Key Insights

  1. Salient Weight Detection: Identify weights that process important activations
  2. Per-Channel Scaling: Apply different scales to different channels
  3. Activation-Aware: Use activation statistics to guide quantization
def awq_quantize(model, calibration_data): # Step 1: Identify salient weights salience_scores = compute_salience(model, calibration_data) # Step 2: Compute per-channel scales for layer in model.layers: # Find channels with high salience important_channels = salience_scores[layer] > threshold # Apply protective scaling scale = torch.ones(layer.out_features) scale[important_channels] *= protection_factor # Quantize with adjusted scales layer.weight = quantize_with_scale(layer.weight, scale)

3. SmoothQuant

SmoothQuant addresses the challenge of activation quantization by smoothing activation outliers into weights.

SmoothQuant: Outlier Smoothing

α:0.5

Weight Distribution

Channel-wise Outlier Magnitude

Layer Type
Attention Q
Outliers
Present
Max Value
8.5
Smoothable
Yes
SmoothQuant Formula:
X̂ = X · diag(s)⁻¹
Ŵ = W · diag(s)
Y = X̂Ŵ = XW (equivalent)
Where s is the per-channel smoothing factor, α controls the migration strength (0 = no smoothing, 1 = maximum smoothing)

The Smoothing Transform

SmoothQuant migrates quantization difficulty from activations to weights:

Y = (W × diag(s)) × (diag(s)^(-1) × X) = W' × X'

Where s is a per-channel smoothing factor:

def compute_smoothing_factor(W, X, alpha=0.5): """ Balance quantization difficulty between weights and activations """ # Compute per-channel statistics w_max = W.abs().max(dim=0).values x_max = X.abs().max(dim=0).values # Smoothing factor s = (x_max / w_max) ** alpha # Apply smoothing W_smooth = W * s.unsqueeze(0) X_smooth = X / s.unsqueeze(0) return W_smooth, X_smooth, s

4. Mixed-Precision Quantization

Not all layers need the same precision. Mixed-precision quantization assigns different bit-widths based on sensitivity analysis.

Mixed Precision Quantization

Layer Precision Configuration

Embedding0.5M params
Sensitivity: 95%
Attention Block 10.3M params
Sensitivity: 85%
FFN Block 10.5M params
Sensitivity: 45%
Attention Block 20.3M params
Sensitivity: 82%
FFN Block 20.5M params
Sensitivity: 42%
Attention Block 30.3M params
Sensitivity: 78%
FFN Block 30.5M params
Sensitivity: 40%
Layer Norm0.0M params
Sensitivity: 88%
Output Head0.5M params
Sensitivity: 92%
Original Size
0.00 GB
Compressed Size
0.00 GB
Compression
NaNx
Est. Accuracy
100.0%

Memory Footprint Visualization

Embedding
Attention
FFN
Attention
FFN
Attention
FFN
Layer
Output
StartModel LayersEnd
Mixed Precision Strategy: Different layers have varying sensitivity to quantization. Attention layers typically need higher precision (FP16/INT8) while FFN layers can often use INT4 without significant quality loss. The sensitivity score indicates how much each layer affects the final output.
def sensitivity_analysis(model, calibration_data): sensitivities = {} for name, layer in model.named_modules(): if isinstance(layer, nn.Linear): # Quantize this layer original_weight = layer.weight.clone() layer.weight.data = quantize(layer.weight, bits=4) # Measure accuracy drop accuracy_drop = evaluate(model) - baseline_accuracy sensitivities[name] = accuracy_drop # Restore original weight layer.weight.data = original_weight return sensitivities def assign_bit_widths(sensitivities, bit_budget): # Assign more bits to sensitive layers sorted_layers = sorted(sensitivities.items(), key=lambda x: x[1], reverse=True) bit_assignment = {} for layer, sensitivity in sorted_layers: if sensitivity > threshold: bit_assignment[layer] = 8 # Keep sensitive layers at 8-bit else: bit_assignment[layer] = 4 # Aggressive quantization for others return bit_assignment

INT4 Quantization: Pushing the Limits

INT4 quantization achieves 8x compression but requires sophisticated techniques to maintain accuracy.

Challenges of INT4

  1. Limited Range: Only 16 unique values
  2. Quantization Noise: High relative error
  3. Gradient Instability: Difficult to train
  4. Outlier Sensitivity: Single outliers can dominate range

Group-wise Quantization

To handle INT4's limitations, we use group-wise quantization:

class GroupwiseQuantization: def __init__(self, group_size=128, bits=4): self.group_size = group_size self.bits = bits def quantize(self, tensor): # Reshape into groups orig_shape = tensor.shape tensor = tensor.reshape(-1, self.group_size) # Quantize each group independently scales = [] quantized_groups = [] for group in tensor: scale = group.abs().max() / (2**(self.bits-1) - 1) scales.append(scale) q_group = torch.round(group / scale) q_group = torch.clamp(q_group, -8, 7) # INT4 range quantized_groups.append(q_group) return quantized_groups, scales

Bit Packing for INT4

Efficient storage requires packing two INT4 values into one byte:

INT4 Bit Packing Visualization

Original FP32 Weights
Quantize to INT4
Convert to Binary
Pack into Bytes
Memory Layout

FP32 Weights (32 bits each)

Weight 0
0.7234
32 bits
Weight 1
-0.3891
32 bits
Weight 2
0.9102
32 bits
Weight 3
-0.1234
32 bits
Weight 4
0.5678
32 bits
Weight 5
-0.8901
32 bits
Weight 6
0.2345
32 bits
Weight 7
-0.6789
32 bits
Original Size
32 bytes
Packed Size
4 bytes
Compression
8x
Bit Packing: INT4 quantization uses only 4 bits per weight. By packing two INT4 values into a single byte (8 bits), we achieve 8x memory reduction compared to FP32. Modern hardware can efficiently unpack these values during computation.
def pack_int4(tensor): """Pack two INT4 values into one INT8""" assert tensor.shape[-1] % 2 == 0 # Reshape to separate pairs tensor = tensor.reshape(-1, 2) # Pack pairs into bytes packed = (tensor[:, 0] & 0xF) | ((tensor[:, 1] & 0xF) << 4) return packed.to(torch.int8) def unpack_int4(packed): """Unpack INT8 into two INT4 values""" # Extract lower 4 bits low = (packed & 0xF).to(torch.int8) low = torch.where(low > 7, low - 16, low) # Sign extend # Extract upper 4 bits high = ((packed >> 4) & 0xF).to(torch.int8) high = torch.where(high > 7, high - 16, high) # Sign extend return torch.stack([low, high], dim=-1).reshape(-1)

Quantization Method Comparison

Let's compare different quantization methods across various metrics:

Quantization Methods Comparison

Post-Training Quantization (PTQ)

Simple8/4-bit
Accuracy
Speed
Compression
Ease of Use
Hardware Support

Quantization-Aware Training (QAT)

Training8/4/2-bit
Accuracy
Speed
Compression
Ease of Use
Hardware Support

GPTQ

Advanced4/3-bit
Accuracy
Speed
Compression
Ease of Use
Hardware Support

AWQ

Advanced4-bit
Accuracy
Speed
Compression
Ease of Use
Hardware Support

SmoothQuant

Advanced8-bit
Accuracy
Speed
Compression
Ease of Use
Hardware Support

BitsAndBytes

Library8/4-bit
Accuracy
Speed
Compression
Ease of Use
Hardware Support
Recommendation: For LLMs, start with AWQ for 4-bit quantization or SmoothQuant for INT8. Use QAT only when maximum accuracy is critical and you have the computational resources for retraining.
MethodBitsSpeedAccuracyComplexityBest For
PTQ Min-Max8FastGoodLowQuick deployment
PTQ Percentile8FastBetterLowRobust to outliers
QAT8/4SlowBestMediumProduction models
GPTQ4MediumExcellentHighLarge models
AWQ4MediumExcellentMediumLLMs
SmoothQuant8FastVery GoodLowActivation quantization

Perplexity vs Model Size Tradeoffs

Understanding the relationship between compression and accuracy is crucial for deployment decisions:

Perplexity vs Model Size Trade-off

Quantization Methods

FP32
FP16
INT8 PTQ
INT8 QAT
GPTQ
AWQ
INT3
INT2
Size Reduction
85%
FP32 → AWQ 4-bit
Perplexity Increase
+0.13
FP32 → AWQ 4-bit
Best Trade-off
AWQ 4-bit
Optimal quality/size
Reading the Chart: Points closer to the bottom-left corner represent better trade-offs (smaller size, lower perplexity). The green region indicates the optimal zone where models maintain good quality while achieving significant compression.

Empirical Results on Common Models

GPT-2 (117M parameters):

  • FP32: Perplexity 29.41, Size: 468MB
  • INT8: Perplexity 29.52, Size: 117MB
  • INT4: Perplexity 30.14, Size: 58.5MB

LLaMA-7B:

  • FP32: Perplexity 5.68, Size: 28GB
  • INT8: Perplexity 5.71, Size: 7GB
  • INT4 (GPTQ): Perplexity 5.85, Size: 3.5GB
  • INT4 (AWQ): Perplexity 5.78, Size: 3.5GB

OPT-175B:

  • FP32: Perplexity 8.34, Size: 700GB
  • INT8: Perplexity 8.38, Size: 175GB
  • INT4: Perplexity 8.51, Size: 87.5GB

Implementation Best Practices

1. Calibration Data Selection

def select_calibration_data(dataset, num_samples=1000): """Select representative calibration samples""" # Strategy 1: Random sampling random_samples = random.sample(dataset, num_samples) # Strategy 2: Diverse sampling (maximize coverage) diverse_samples = [] embeddings = compute_embeddings(dataset) # K-means clustering for diversity clusters = KMeans(n_clusters=num_samples).fit(embeddings) for center in clusters.cluster_centers_: closest_idx = find_nearest(embeddings, center) diverse_samples.append(dataset[closest_idx]) return diverse_samples

2. Layer-wise Bit Assignment

def optimize_bit_assignment(model, target_size_mb): """Find optimal per-layer bit widths given size constraint""" layer_sizes = get_layer_sizes(model) layer_sensitivities = compute_sensitivities(model) # Dynamic programming solution dp = {} # (layer_idx, remaining_budget) -> (accuracy, assignment) def solve(idx, budget): if idx == len(layer_sizes): return 0, [] if (idx, budget) in dp: return dp[(idx, budget)] best_accuracy = -float('inf') best_assignment = [] # Try different bit widths for bits in [4, 6, 8]: size = layer_sizes[idx] * bits / 8 if size <= budget: accuracy_loss = layer_sensitivities[idx][bits] future_acc, future_assign = solve(idx + 1, budget - size) total_acc = -accuracy_loss + future_acc if total_acc > best_accuracy: best_accuracy = total_acc best_assignment = [bits] + future_assign dp[(idx, budget)] = (best_accuracy, best_assignment) return best_accuracy, best_assignment return solve(0, target_size_mb)

3. Quantization Pipeline

Dynamic vs Static Quantization

Quantization Adaptation Over Time

Current Bits
8
Avg Bits
0.0
Compression
1.0x
Batches
0

Dynamic Quantization

Advantages: Adapts to data distribution, better accuracy, handles outliers
Trade-offs: Runtime overhead, requires statistics computation
class QuantizationPipeline: def __init__(self, method='gptq', bits=4): self.method = method self.bits = bits def quantize_model(self, model, calibration_loader): # Step 1: Prepare model model.eval() # Step 2: Collect statistics if self.method in ['gptq', 'awq']: statistics = self.collect_activation_statistics( model, calibration_loader ) # Step 3: Apply quantization quantized_model = self.apply_quantization(model, statistics) # Step 4: Verify accuracy accuracy = self.validate(quantized_model, calibration_loader) return quantized_model, accuracy def collect_activation_statistics(self, model, loader): statistics = {} hooks = [] def hook_fn(module, input, output, name): if name not in statistics: statistics[name] = [] statistics[name].append(output.detach()) # Register hooks for name, module in model.named_modules(): if isinstance(module, nn.Linear): hook = module.register_forward_hook( lambda m, i, o, n=name: hook_fn(m, i, o, n) ) hooks.append(hook) # Run calibration with torch.no_grad(): for batch in loader: model(batch) # Remove hooks for hook in hooks: hook.remove() return statistics

Hardware Considerations

INT8 Hardware Support

Most modern hardware has native INT8 support:

INT8 Matrix Multiplication

Original FP32 Matrices
Quantize to INT8
INT8 Matrix Multiplication
Dequantize Result
Compare with FP32

FP32 Matrices

Matrix A (3×3)
0.73
-0.45
0.92
-0.31
0.67
-0.89
0.55
-0.22
0.78
×
Matrix B (3×2)
0.84
-0.62
-0.41
0.95
0.73
-0.38
=
Result (3×2)
1.47
-1.23
-1.18
1.17
1.12
-0.85
Memory
4x smaller
INT8 vs FP32
Speed
2-4x faster
On modern CPUs
Operations
INT8 GEMM
Hardware optimized
INT8 GEMM Optimization: Modern processors have specialized instructions for INT8 matrix multiplication (like AVX-512 VNNI on Intel, Tensor Cores on NVIDIA). These can perform multiple INT8 operations in a single cycle, providing significant speedup over FP32 while maintaining acceptable accuracy for most deep learning workloads.

NVIDIA GPUs:

  • Tensor Cores (Volta+): 8x throughput vs FP32
  • DP4A instruction: 4-element dot product
  • INT8 GEMM: Optimized matrix multiplication

Intel CPUs:

  • VNNI (Cascade Lake+): Vector Neural Network Instructions
  • AMX (Sapphire Rapids+): Advanced Matrix Extensions

ARM CPUs:

  • Dot Product instructions (ARMv8.2+)
  • Matrix Multiply instructions (ARMv8.6+)

INT4 Hardware Support

INT4 support is emerging:

NVIDIA GPUs:

  • Ada Lovelace: FP8 and INT4 Tensor Cores
  • Hopper: Transformer Engine with dynamic precision

Specialized Hardware:

  • Qualcomm Hexagon: INT4 for edge AI
  • Google TPU v4: INT4 systolic arrays

Debugging Quantization Issues

Common Problems and Solutions

Problem 1: Activation Outliers

def detect_outliers(activations, threshold=6.0): """Detect activation outliers using z-score""" mean = activations.mean() std = activations.std() z_scores = torch.abs((activations - mean) / std) outliers = z_scores > threshold if outliers.any(): print(f"Found {outliers.sum()} outliers") # Apply clipping or smoothing activations = torch.clamp(activations, mean - threshold * std, mean + threshold * std) return activations

Problem 2: Quantization Bias

def correct_quantization_bias(weights, quantized_weights): """Correct systematic bias in quantization""" bias = (weights - quantized_weights).mean() # Adjust zero point to minimize bias corrected = quantized_weights + bias return corrected

Problem 3: Layer Collapse

def prevent_layer_collapse(model, min_scale=1e-5): """Prevent layers from quantizing to all zeros""" for module in model.modules(): if hasattr(module, 'scale'): module.scale.data = torch.clamp(module.scale.data, min=min_scale)

Production Deployment

Quantization for Different Frameworks

PyTorch:

import torch.quantization as quant # Dynamic quantization (easiest) model_int8 = quant.quantize_dynamic( model, {nn.Linear}, dtype=torch.qint8 ) # Static quantization (better performance) model.qconfig = quant.get_default_qconfig('fbgemm') quant.prepare(model, inplace=True) # ... run calibration ... quant.convert(model, inplace=True)

TensorFlow/Keras:

import tensorflow as tf converter = tf.lite.TFLiteConverter.from_keras_model(model) converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.representative_dataset = calibration_generator converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] tflite_model = converter.convert()

ONNX Runtime:

from onnxruntime.quantization import quantize_dynamic quantize_dynamic( model_input='model.onnx', model_output='model_int8.onnx', weight_type=QuantType.QInt8 )

Serving Quantized Models

class QuantizedModelServer: def __init__(self, model_path, quantization_config): self.model = self.load_quantized_model(model_path) self.config = quantization_config def preprocess(self, input_data): # Scale inputs if needed if self.config.quantize_inputs: input_data = self.quantize_tensor( input_data, self.config.input_scale, self.config.input_zero_point ) return input_data def inference(self, input_data): with torch.no_grad(): # Run quantized inference output = self.model(input_data) # Dequantize output if needed if self.config.quantized_output: output = self.dequantize_tensor( output, self.config.output_scale, self.config.output_zero_point ) return output def benchmark(self, num_runs=100): dummy_input = torch.randn(1, 512) # Warmup for _ in range(10): self.inference(dummy_input) # Benchmark start = time.time() for _ in range(num_runs): self.inference(dummy_input) avg_latency = (time.time() - start) / num_runs * 1000 return avg_latency

Future Directions

Emerging Techniques

  1. Learned Step Size Quantization (LSQ): Learning optimal quantization parameters end-to-end
  2. Mixed-Bit Networks: Different bits for different samples
  3. Gradient Quantization: Quantizing gradients for distributed training
  4. Outlier-Aware Quantization: Special handling for outlier weights/activations
  5. Neural Architecture Search for Quantization: Jointly optimizing architecture and quantization

Research Frontiers

Sub-4-bit Quantization:

  • Binary (1-bit) and ternary (2-bit) networks
  • Learned codebooks for extreme compression
  • Product quantization for large embeddings

Hardware-Software Co-design:

  • Custom quantization for specific hardware
  • Compiler optimizations for quantized models
  • Automated precision tuning

Conclusion

Quantization has evolved from a simple compression technique to a sophisticated field combining optimization theory, hardware design, and deep learning. Modern methods like GPTQ, AWQ, and SmoothQuant enable extreme compression while maintaining accuracy, making billion-parameter models accessible on consumer hardware.

The journey from FP32 to INT4 represents an 8x reduction in memory and often similar speedups in computation. As models continue to grow and edge deployment becomes critical, quantization will remain at the forefront of efficient AI.

Key takeaways:

  • Start with INT8: Often provides 4x compression with minimal accuracy loss
  • Use PTQ for speed: When you need quick deployment and can tolerate small accuracy drops
  • Apply QAT for quality: When accuracy is critical and you can afford retraining
  • Consider GPTQ/AWQ for LLMs: State-of-the-art methods for extreme compression
  • Profile everything: Measure latency, memory, and accuracy for your specific use case

References

  1. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation
  2. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
  3. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
  4. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
  5. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
  6. A Survey of Quantization Methods for Efficient Neural Network Inference
  7. Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Further Reading

Abhik Sarkar

Abhik Sarkar

Machine Learning Consultant specializing in Computer Vision and Deep Learning. Leading ML teams and building innovative solutions.

Share this article

If you found this article helpful, consider sharing it with your network

Mastodon