Quantization Deep Dive: From FP32 to INT4 - The Complete Guide

Introduction

As neural networks grow from millions to billions of parameters, deployment becomes increasingly challenging. A 7B parameter model in FP32 requires 28GB of memory just for weights - exceeding most consumer GPUs. Enter quantization: the art of reducing numerical precision while preserving model accuracy.

This deep dive explores the journey from FP32 to INT4, examining state-of-the-art quantization techniques that enable running GPT-scale models on edge devices. Through interactive visualizations, we'll understand how modern quantization methods achieve 8x compression with minimal accuracy loss.

Interactive Learning: This article features 10+ interactive demos to help you understand quantization concepts. Each visualization lets you experiment with parameters and see their effects in real-time.

The Quantization Landscape

Quantization transforms high-precision floating-point weights and activations into lower-precision representations. But it's not just about reducing bits - it's about intelligently preserving the information that matters most for model performance.

Precision Comparison: FP32 to INT4

Bit Representation

32 bits used

Range:±3.4×10³⁸

Precision:~7 digits

Quantization Effect

Test Value

Original (FP32)

3.14159000

Quantized (FP32)

3.14159000

Model Size Impact (7B Parameter Model)

FP32

28.0 GB

100.0x smaller

FP16

14.0 GB

50.0x smaller

INT8

7.0 GB

25.0x smaller

INT4

3.5 GB

12.5x smaller

Memory

100%

Speed

Compression

1.0x

Understanding Numerical Precision

Before diving into quantization methods, let's understand what we're actually compressing and why it works.

Floating Point vs Integer Representation

FP32 (Float32): 1 sign bit, 8 exponent bits, 23 mantissa bits

Range: ±3.4 × 10³⁸
Precision: ~7 decimal digits
Memory: 4 bytes per weight

FP16 (Float16): 1 sign bit, 5 exponent bits, 10 mantissa bits

Range: ±65,504
Precision: ~3 decimal digits
Memory: 2 bytes per weight

INT8: 8-bit signed integer

Range: -128 to 127
Precision: Exact integers
Memory: 1 byte per weight

INT4: 4-bit signed integer

Range: -8 to 7
Precision: Exact integers
Memory: 0.5 bytes per weight

Why Quantization Works

Neural networks are surprisingly robust to reduced precision because:

Redundancy: Networks have redundant parameters
Noise Tolerance: Training introduces noise resilience
Limited Precision Need: Most weights cluster around zero
Activation Patterns: Only certain neurons fire for given inputs

Weight Distribution Analyzer

Mean

0.0000

Std Dev

0.0200

Skewness

0.0000

Weights concentrated near zero are ideal for quantization. Outliers (far from zero) can cause accuracy loss.

Quantization Fundamentals

The Quantization Equation

The core of quantization is a simple linear transformation:

Quantized = round(Original / Scale + ZeroPoint)
Dequantized = (Quantized - ZeroPoint) × Scale

Where:

Scale: Determines the step size between quantized values
Zero Point: Aligns the quantization grid with the data distribution

Symmetric vs Asymmetric Quantization

Symmetric Quantization:

Zero point is always 0
Range: [-127, 127] for INT8
Simpler hardware implementation
May waste range if distribution is skewed

Asymmetric Quantization:

Zero point can be any value
Range: [-128, 127] for INT8
Better utilization of quantization range
More complex but often more accurate

Quantization Error Analysis

Quantization Bits: 8

2-bit4-bit6-bit8-bit

Quantization Levels

256

Scale Factor

0.0435

Zero Point

Avg Error

0.8263

RMSE

1.3765

Post-Training Quantization (PTQ)

PTQ quantizes an already-trained model without retraining. It's fast and simple but may suffer accuracy loss for aggressive quantization.

Basic PTQ Pipeline

Calibration: Run representative data through the model
Statistics Collection: Gather min/max or percentile statistics
Scale Calculation: Compute optimal scales for each layer
Quantization: Convert weights and activations
Validation: Check accuracy degradation

Calibration Methods

Min-Max Calibration:

def minmax_calibration(tensor):
    min_val = tensor.min()
    max_val = tensor.max()
    scale = (max_val - min_val) / 255  # For INT8
    zero_point = round(-min_val / scale)
    return scale, zero_point

Percentile Calibration:

def percentile_calibration(tensor, percentile=99.9):
    min_val = torch.quantile(tensor, (100 - percentile) / 100)
    max_val = torch.quantile(tensor, percentile / 100)
    scale = (max_val - min_val) / 255
    zero_point = round(-min_val / scale)
    return scale, zero_point

Entropy Calibration (KL Divergence):

def entropy_calibration(tensor, num_bins=2048):
    # Build histogram
    hist, bins = torch.histogram(tensor, bins=num_bins)
    
    # Find threshold that minimizes KL divergence
    best_threshold = find_optimal_threshold(hist, bins)
    
    scale = best_threshold / 127  # Symmetric quantization
    return scale, 0

Quantization-Aware Training (QAT)

QAT simulates quantization during training, allowing the network to adapt to reduced precision. This typically yields better accuracy than PTQ, especially for low-bit quantization.

QAT vs PTQ Comparison

QAT Forward Pass

During QAT, we inject fake quantization operations:

class FakeQuantize(nn.Module):
    def __init__(self, num_bits=8):
        super().__init__()
        self.num_bits = num_bits
        self.scale = nn.Parameter(torch.tensor(1.0))
        self.zero_point = nn.Parameter(torch.tensor(0.0))
        
    def forward(self, x):
        if self.training:
            # Fake quantize: quantize then dequantize
            x_q = torch.round(x / self.scale + self.zero_point)
            x_q = torch.clamp(x_q, 0, 2**self.num_bits - 1)
            x_dq = (x_q - self.zero_point) * self.scale
            
            # Straight-through estimator for gradients
            return x + (x_dq - x).detach()
        else:
            # Real quantization during inference
            return self.quantize(x)

Learnable Quantization Parameters

Modern QAT methods learn optimal scales and zero points:

class LearnedQuantization(nn.Module):
    def __init__(self, num_features):
        super().__init__()
        self.scale = nn.Parameter(torch.ones(num_features))
        self.zero_point = nn.Parameter(torch.zeros(num_features))
        
    def forward(self, x):
        # Per-channel quantization
        scale = F.softplus(self.scale)  # Ensure positive
        zero_point = self.zero_point
        
        x_q = torch.round(x / scale + zero_point)
        x_q = torch.clamp(x_q, -128, 127)
        return (x_q - zero_point) * scale

Advanced Quantization Methods

1. GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ uses layer-wise quantization with Hessian-based optimization to minimize reconstruction error. It's particularly effective for large language models.

GPTQ Layer-wise Quantization

Layer Processing

Layer 1

4096 weights

Layer 2

4096 weights

Layer 3

4096 weights

Layer 4

4096 weights

Block-wise Optimization

Hessian Matrix (H = X^T X)

1.00

0.11

0.04

0.01

0.11

1.00

0.11

0.04

0.11

1.00

0.11

0.01

0.04

0.11

1.00

Optimization Objective

min ||WX - W_q X||²

W: Original weights, W_q: Quantized weights

Target Bits

Compression

Quantized

0/4

Avg Error

0.0000

The GPTQ Algorithm

GPTQ solves an optimization problem for each layer:

minimize ||WX - W_quantized × X||²

Key innovations:

Layer-wise Quantization: Process one layer at a time
Hessian Awareness: Use second-order information
Lazy Batch Updates: Update weights in blocks
Cholesky Decomposition: Efficient inverse computation

def gptq_quantize_layer(W, X, num_bits=4):
    """
    W: Weight matrix [out_features, in_features]
    X: Calibration data [batch_size, in_features]
    """
    # Compute Hessian H = X^T X
    H = X.T @ X
    
    # Add damping for numerical stability
    H_inv = torch.inverse(H + 0.01 * torch.eye(H.shape[0]))
    
    # Initialize quantized weights
    W_q = torch.zeros_like(W)
    
    # Process weights in blocks
    block_size = 128
    for i in range(0, W.shape[1], block_size):
        block = W[:, i:i+block_size]
        
        # Compute optimal quantization
        scale = compute_optimal_scale(block)
        W_q[:, i:i+block_size] = quantize(block, scale, num_bits)
        
        # Update remaining weights to compensate
        if i + block_size < W.shape[1]:
            error = (block - W_q[:, i:i+block_size]) @ H_inv[i:i+block_size, i+block_size:]
            W[:, i+block_size:] -= error
    
    return W_q

2. AWQ (Activation-aware Weight Quantization)

AWQ recognizes that not all weights are equally important - those processing salient activations need higher precision.

AWQ: Activation-aware Weight Quantization

Collect Activations

Compute Importance

Apply Scaling

Quantize Weights

Activation Heatmap (16 channels × 8 spatial)

Channel 0 Analysis

Avg Activation0.519

Max Activation2.046

Importance---

Scale Factor---

Select Channel:

AWQ Key Insight: Weights corresponding to larger activations are more important. By scaling these weights up before quantization and scaling activations down accordingly, we preserve model accuracy while achieving aggressive quantization.

Bit Width

Compression

Accuracy

100%

AWQ Key Insights

Salient Weight Detection: Identify weights that process important activations
Per-Channel Scaling: Apply different scales to different channels
Activation-Aware: Use activation statistics to guide quantization

def awq_quantize(model, calibration_data):
    # Step 1: Identify salient weights
    salience_scores = compute_salience(model, calibration_data)
    
    # Step 2: Compute per-channel scales
    for layer in model.layers:
        # Find channels with high salience
        important_channels = salience_scores[layer] > threshold
        
        # Apply protective scaling
        scale = torch.ones(layer.out_features)
        scale[important_channels] *= protection_factor
        
        # Quantize with adjusted scales
        layer.weight = quantize_with_scale(layer.weight, scale)

3. SmoothQuant

SmoothQuant addresses the challenge of activation quantization by smoothing activation outliers into weights.

SmoothQuant: Outlier Smoothing

α:0.5

Weight Distribution

Channel-wise Outlier Magnitude

Layer Type

Attention Q

Outliers

Present

Max Value

8.5

Smoothable

Yes

SmoothQuant Formula:

X̂ = X · diag(s)⁻¹
Ŵ = W · diag(s)
Y = X̂Ŵ = XW (equivalent)

Where s is the per-channel smoothing factor, α controls the migration strength (0 = no smoothing, 1 = maximum smoothing)

The Smoothing Transform

SmoothQuant migrates quantization difficulty from activations to weights:

Y = (W × diag(s)) × (diag(s)^(-1) × X) = W' × X'

Where s is a per-channel smoothing factor:

def compute_smoothing_factor(W, X, alpha=0.5):
    """
    Balance quantization difficulty between weights and activations
    """
    # Compute per-channel statistics
    w_max = W.abs().max(dim=0).values
    x_max = X.abs().max(dim=0).values
    
    # Smoothing factor
    s = (x_max / w_max) ** alpha
    
    # Apply smoothing
    W_smooth = W * s.unsqueeze(0)
    X_smooth = X / s.unsqueeze(0)
    
    return W_smooth, X_smooth, s

4. Mixed-Precision Quantization

Not all layers need the same precision. Mixed-precision quantization assigns different bit-widths based on sensitivity analysis.

Mixed Precision Quantization

Layer Precision Configuration

Embedding0.5M params

Sensitivity: 95%

Attention Block 10.3M params

Sensitivity: 85%

FFN Block 10.5M params

Sensitivity: 45%

Attention Block 20.3M params

Sensitivity: 82%

FFN Block 20.5M params

Sensitivity: 42%

Attention Block 30.3M params

Sensitivity: 78%

FFN Block 30.5M params

Sensitivity: 40%

Layer Norm0.0M params

Sensitivity: 88%

Output Head0.5M params

Sensitivity: 92%

Original Size

0.00 GB

Compressed Size

0.00 GB

Compression

NaNx

Est. Accuracy

100.0%

Memory Footprint Visualization

Embedding

Attention

FFN

Attention

FFN

Attention

FFN

Layer

Output

StartModel LayersEnd

Mixed Precision Strategy: Different layers have varying sensitivity to quantization. Attention layers typically need higher precision (FP16/INT8) while FFN layers can often use INT4 without significant quality loss. The sensitivity score indicates how much each layer affects the final output.

def sensitivity_analysis(model, calibration_data):
    sensitivities = {}
    
    for name, layer in model.named_modules():
        if isinstance(layer, nn.Linear):
            # Quantize this layer
            original_weight = layer.weight.clone()
            layer.weight.data = quantize(layer.weight, bits=4)
            
            # Measure accuracy drop
            accuracy_drop = evaluate(model) - baseline_accuracy
            sensitivities[name] = accuracy_drop
            
            # Restore original weight
            layer.weight.data = original_weight
    
    return sensitivities

def assign_bit_widths(sensitivities, bit_budget):
    # Assign more bits to sensitive layers
    sorted_layers = sorted(sensitivities.items(), key=lambda x: x[1], reverse=True)
    
    bit_assignment = {}
    for layer, sensitivity in sorted_layers:
        if sensitivity > threshold:
            bit_assignment[layer] = 8  # Keep sensitive layers at 8-bit
        else:
            bit_assignment[layer] = 4  # Aggressive quantization for others
    
    return bit_assignment

INT4 Quantization: Pushing the Limits

INT4 quantization achieves 8x compression but requires sophisticated techniques to maintain accuracy.

Challenges of INT4

Limited Range: Only 16 unique values
Quantization Noise: High relative error
Gradient Instability: Difficult to train
Outlier Sensitivity: Single outliers can dominate range

Group-wise Quantization

To handle INT4's limitations, we use group-wise quantization:

class GroupwiseQuantization:
    def __init__(self, group_size=128, bits=4):
        self.group_size = group_size
        self.bits = bits
        
    def quantize(self, tensor):
        # Reshape into groups
        orig_shape = tensor.shape
        tensor = tensor.reshape(-1, self.group_size)
        
        # Quantize each group independently
        scales = []
        quantized_groups = []
        
        for group in tensor:
            scale = group.abs().max() / (2**(self.bits-1) - 1)
            scales.append(scale)
            
            q_group = torch.round(group / scale)
            q_group = torch.clamp(q_group, -8, 7)  # INT4 range
            quantized_groups.append(q_group)
        
        return quantized_groups, scales

Bit Packing for INT4

Efficient storage requires packing two INT4 values into one byte:

INT4 Bit Packing Visualization

Original FP32 Weights

Quantize to INT4

Convert to Binary

Pack into Bytes

Memory Layout

FP32 Weights (32 bits each)

Weight 0

0.7234

32 bits

Weight 1

-0.3891

32 bits

Weight 2

0.9102

32 bits

Weight 3

-0.1234

32 bits

Weight 4

0.5678

32 bits

Weight 5

-0.8901

32 bits

Weight 6

0.2345

32 bits

Weight 7

-0.6789

32 bits

Original Size

32 bytes

Packed Size

4 bytes

Compression

Bit Packing: INT4 quantization uses only 4 bits per weight. By packing two INT4 values into a single byte (8 bits), we achieve 8x memory reduction compared to FP32. Modern hardware can efficiently unpack these values during computation.

def pack_int4(tensor):
    """Pack two INT4 values into one INT8"""
    assert tensor.shape[-1] % 2 == 0
    
    # Reshape to separate pairs
    tensor = tensor.reshape(-1, 2)
    
    # Pack pairs into bytes
    packed = (tensor[:, 0] & 0xF) | ((tensor[:, 1] & 0xF) << 4)
    
    return packed.to(torch.int8)

def unpack_int4(packed):
    """Unpack INT8 into two INT4 values"""
    # Extract lower 4 bits
    low = (packed & 0xF).to(torch.int8)
    low = torch.where(low > 7, low - 16, low)  # Sign extend
    
    # Extract upper 4 bits
    high = ((packed >> 4) & 0xF).to(torch.int8)
    high = torch.where(high > 7, high - 16, high)  # Sign extend
    
    return torch.stack([low, high], dim=-1).reshape(-1)

Quantization Method Comparison

Ease of Use

Hardware Support

Recommendation: For LLMs, start with AWQ for 4-bit quantization or SmoothQuant for INT8. Use QAT only when maximum accuracy is critical and you have the computational resources for retraining.

Method	Bits	Speed	Accuracy	Complexity	Best For
PTQ Min-Max	8	Fast	Good	Low	Quick deployment
PTQ Percentile	8	Fast	Better	Low	Robust to outliers
QAT	8/4	Slow	Best	Medium	Production models
GPTQ	4	Medium	Excellent	High	Large models
AWQ	4	Medium	Excellent	Medium	LLMs
SmoothQuant	8	Fast	Very Good	Low	Activation quantization

Perplexity vs Model Size Tradeoffs

Understanding the relationship between compression and accuracy is crucial for deployment decisions:

Perplexity vs Model Size Trade-off

Quantization Methods

FP32

FP16

INT8 PTQ

INT8 QAT

GPTQ

AWQ

INT3

INT2

Size Reduction

85%

FP32 → AWQ 4-bit

Perplexity Increase

+0.13

FP32 → AWQ 4-bit

Best Trade-off

AWQ 4-bit

Optimal quality/size

Reading the Chart: Points closer to the bottom-left corner represent better trade-offs (smaller size, lower perplexity). The green region indicates the optimal zone where models maintain good quality while achieving significant compression.

Empirical Results on Common Models

GPT-2 (117M parameters):

FP32: Perplexity 29.41, Size: 468MB
INT8: Perplexity 29.52, Size: 117MB
INT4: Perplexity 30.14, Size: 58.5MB

LLaMA-7B:

FP32: Perplexity 5.68, Size: 28GB
INT8: Perplexity 5.71, Size: 7GB
INT4 (GPTQ): Perplexity 5.85, Size: 3.5GB
INT4 (AWQ): Perplexity 5.78, Size: 3.5GB

OPT-175B:

FP32: Perplexity 8.34, Size: 700GB
INT8: Perplexity 8.38, Size: 175GB
INT4: Perplexity 8.51, Size: 87.5GB

Implementation Best Practices

1. Calibration Data Selection

def select_calibration_data(dataset, num_samples=1000):
    """Select representative calibration samples"""
    # Strategy 1: Random sampling
    random_samples = random.sample(dataset, num_samples)
    
    # Strategy 2: Diverse sampling (maximize coverage)
    diverse_samples = []
    embeddings = compute_embeddings(dataset)
    
    # K-means clustering for diversity
    clusters = KMeans(n_clusters=num_samples).fit(embeddings)
    for center in clusters.cluster_centers_:
        closest_idx = find_nearest(embeddings, center)
        diverse_samples.append(dataset[closest_idx])
    
    return diverse_samples

2. Layer-wise Bit Assignment

def optimize_bit_assignment(model, target_size_mb):
    """Find optimal per-layer bit widths given size constraint"""
    layer_sizes = get_layer_sizes(model)
    layer_sensitivities = compute_sensitivities(model)
    
    # Dynamic programming solution
    dp = {}  # (layer_idx, remaining_budget) -> (accuracy, assignment)
    
    def solve(idx, budget):
        if idx == len(layer_sizes):
            return 0, []
        
        if (idx, budget) in dp:
            return dp[(idx, budget)]
        
        best_accuracy = -float('inf')
        best_assignment = []
        
        # Try different bit widths
        for bits in [4, 6, 8]:
            size = layer_sizes[idx] * bits / 8
            if size <= budget:
                accuracy_loss = layer_sensitivities[idx][bits]
                future_acc, future_assign = solve(idx + 1, budget - size)
                
                total_acc = -accuracy_loss + future_acc
                if total_acc > best_accuracy:
                    best_accuracy = total_acc
                    best_assignment = [bits] + future_assign
        
        dp[(idx, budget)] = (best_accuracy, best_assignment)
        return best_accuracy, best_assignment
    
    return solve(0, target_size_mb)

3. Quantization Pipeline

Dynamic vs Static Quantization

Quantization Adaptation Over Time

Current Bits

Avg Bits

0.0

Compression

1.0x

Batches

Dynamic Quantization

Advantages: Adapts to data distribution, better accuracy, handles outliers
Trade-offs: Runtime overhead, requires statistics computation

class QuantizationPipeline:
    def __init__(self, method='gptq', bits=4):
        self.method = method
        self.bits = bits
        
    def quantize_model(self, model, calibration_loader):
        # Step 1: Prepare model
        model.eval()
        
        # Step 2: Collect statistics
        if self.method in ['gptq', 'awq']:
            statistics = self.collect_activation_statistics(
                model, calibration_loader
            )
        
        # Step 3: Apply quantization
        quantized_model = self.apply_quantization(model, statistics)
        
        # Step 4: Verify accuracy
        accuracy = self.validate(quantized_model, calibration_loader)
        
        return quantized_model, accuracy
    
    def collect_activation_statistics(self, model, loader):
        statistics = {}
        hooks = []
        
        def hook_fn(module, input, output, name):
            if name not in statistics:
                statistics[name] = []
            statistics[name].append(output.detach())
        
        # Register hooks
        for name, module in model.named_modules():
            if isinstance(module, nn.Linear):
                hook = module.register_forward_hook(
                    lambda m, i, o, n=name: hook_fn(m, i, o, n)
                )
                hooks.append(hook)
        
        # Run calibration
        with torch.no_grad():
            for batch in loader:
                model(batch)
        
        # Remove hooks
        for hook in hooks:
            hook.remove()
        
        return statistics

Hardware Considerations

INT8 Hardware Support

Most modern hardware has native INT8 support:

INT8 Matrix Multiplication

Original FP32 Matrices

Quantize to INT8

INT8 Matrix Multiplication

Dequantize Result

Compare with FP32

FP32 Matrices

Matrix A (3×3)

0.73

-0.45

0.92

-0.31

0.67

-0.89

0.55

-0.22

0.78

Matrix B (3×2)

0.84

-0.62

-0.41

0.95

0.73

-0.38

Result (3×2)

1.47

-1.23

-1.18

1.17

1.12

-0.85

Memory

4x smaller

INT8 vs FP32

Speed

2-4x faster

On modern CPUs

Operations

INT8 GEMM

Hardware optimized

INT8 GEMM Optimization: Modern processors have specialized instructions for INT8 matrix multiplication (like AVX-512 VNNI on Intel, Tensor Cores on NVIDIA). These can perform multiple INT8 operations in a single cycle, providing significant speedup over FP32 while maintaining acceptable accuracy for most deep learning workloads.

NVIDIA GPUs:

Tensor Cores (Volta+): 8x throughput vs FP32
DP4A instruction: 4-element dot product
INT8 GEMM: Optimized matrix multiplication

Intel CPUs:

VNNI (Cascade Lake+): Vector Neural Network Instructions
AMX (Sapphire Rapids+): Advanced Matrix Extensions

ARM CPUs:

Dot Product instructions (ARMv8.2+)
Matrix Multiply instructions (ARMv8.6+)

INT4 Hardware Support

INT4 support is emerging:

NVIDIA GPUs:

Ada Lovelace: FP8 and INT4 Tensor Cores
Hopper: Transformer Engine with dynamic precision

Specialized Hardware:

Qualcomm Hexagon: INT4 for edge AI
Google TPU v4: INT4 systolic arrays

Debugging Quantization Issues

Common Problems and Solutions

Problem 1: Activation Outliers

def detect_outliers(activations, threshold=6.0):
    """Detect activation outliers using z-score"""
    mean = activations.mean()
    std = activations.std()
    z_scores = torch.abs((activations - mean) / std)
    outliers = z_scores > threshold
    
    if outliers.any():
        print(f"Found {outliers.sum()} outliers")
        # Apply clipping or smoothing
        activations = torch.clamp(activations, 
                                  mean - threshold * std,
                                  mean + threshold * std)
    return activations

Problem 2: Quantization Bias

def correct_quantization_bias(weights, quantized_weights):
    """Correct systematic bias in quantization"""
    bias = (weights - quantized_weights).mean()
    
    # Adjust zero point to minimize bias
    corrected = quantized_weights + bias
    
    return corrected

Problem 3: Layer Collapse

def prevent_layer_collapse(model, min_scale=1e-5):
    """Prevent layers from quantizing to all zeros"""
    for module in model.modules():
        if hasattr(module, 'scale'):
            module.scale.data = torch.clamp(module.scale.data, min=min_scale)

Production Deployment

Quantization for Different Frameworks

PyTorch:

import torch.quantization as quant

# Dynamic quantization (easiest)
model_int8 = quant.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)

# Static quantization (better performance)
model.qconfig = quant.get_default_qconfig('fbgemm')
quant.prepare(model, inplace=True)
# ... run calibration ...
quant.convert(model, inplace=True)

TensorFlow/Keras:

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = calibration_generator
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
tflite_model = converter.convert()

ONNX Runtime:

from onnxruntime.quantization import quantize_dynamic

quantize_dynamic(
    model_input='model.onnx',
    model_output='model_int8.onnx',
    weight_type=QuantType.QInt8
)

Serving Quantized Models

class QuantizedModelServer:
    def __init__(self, model_path, quantization_config):
        self.model = self.load_quantized_model(model_path)
        self.config = quantization_config
        
    def preprocess(self, input_data):
        # Scale inputs if needed
        if self.config.quantize_inputs:
            input_data = self.quantize_tensor(
                input_data, 
                self.config.input_scale,
                self.config.input_zero_point
            )
        return input_data
    
    def inference(self, input_data):
        with torch.no_grad():
            # Run quantized inference
            output = self.model(input_data)
            
            # Dequantize output if needed
            if self.config.quantized_output:
                output = self.dequantize_tensor(
                    output,
                    self.config.output_scale,
                    self.config.output_zero_point
                )
        return output
    
    def benchmark(self, num_runs=100):
        dummy_input = torch.randn(1, 512)
        
        # Warmup
        for _ in range(10):
            self.inference(dummy_input)
        
        # Benchmark
        start = time.time()
        for _ in range(num_runs):
            self.inference(dummy_input)
        
        avg_latency = (time.time() - start) / num_runs * 1000
        return avg_latency

Future Directions

Emerging Techniques

Learned Step Size Quantization (LSQ): Learning optimal quantization parameters end-to-end
Mixed-Bit Networks: Different bits for different samples
Gradient Quantization: Quantizing gradients for distributed training
Outlier-Aware Quantization: Special handling for outlier weights/activations
Neural Architecture Search for Quantization: Jointly optimizing architecture and quantization

Research Frontiers

Sub-4-bit Quantization:

Binary (1-bit) and ternary (2-bit) networks
Learned codebooks for extreme compression
Product quantization for large embeddings

Hardware-Software Co-design:

Custom quantization for specific hardware
Compiler optimizations for quantized models
Automated precision tuning

Conclusion

Quantization has evolved from a simple compression technique to a sophisticated field combining optimization theory, hardware design, and deep learning. Modern methods like GPTQ, AWQ, and SmoothQuant enable extreme compression while maintaining accuracy, making billion-parameter models accessible on consumer hardware.

The journey from FP32 to INT4 represents an 8x reduction in memory and often similar speedups in computation. As models continue to grow and edge deployment becomes critical, quantization will remain at the forefront of efficient AI.

Key takeaways:

Start with INT8: Often provides 4x compression with minimal accuracy loss
Use PTQ for speed: When you need quick deployment and can tolerate small accuracy drops
Apply QAT for quality: When accuracy is critical and you can afford retraining
Consider GPTQ/AWQ for LLMs: State-of-the-art methods for extreme compression
Profile everything: Measure latency, memory, and accuracy for your specific use case