Batch Normalization Explained

Batch normalization is a technique that normalizes the inputs to each layer in a neural network, accelerating training and improving model performance. It addresses the internal covariate shift problem by ensuring that layer inputs maintain consistent statistical properties throughout training.

Interactive Visualization

Batch Normalization Visualization

Batch Normalization Demo

Current Step:Input batch data - each row is a sample, each column is a feature

Batch Normalization Formula

μ = (1/m) × Σ xᵢ

σ² = (1/m) × Σ (xᵢ - μ)²

x̂ = (x - μ) / √(σ² + ε)

y = γ × x̂ + β

Input Batch

Sample	F0	F1	F2	F3
0	-2.09	9.97	14.73	32.13
1	-1.25	9.67	15.39	29.72
2	-1.15	12.33	14.69	29.59
3	-0.38	7.35	20.16	25.81
4	-0.47	5.54	16.66	24.35
5	0.67	12.43	20.60	28.83
6	-2.09	11.77	21.55	25.36
7	-0.36	4.20	11.01	25.14

Normalized Data

Sample	F0	F1	F2	F3
0	--	--	--	--
1	--	--	--	--
2	--	--	--	--
3	--	--	--	--
4	--	--	--	--
5	--	--	--	--
6	--	--	--	--
7	--	--	--	--

Output

Sample	F0	F1	F2	F3
0	--	--	--	--
1	--	--	--	--
2	--	--	--	--
3	--	--	--	--
4	--	--	--	--
5	--	--	--	--
6	--	--	--	--
7	--	--	--	--

Batch Statistics

Mean (μ)

0.00

Variance (σ²)

1.00

Std Dev (σ)

1.00

Learnable Parameters & Running Stats

γ (scale)

β (shift)

Running μ

0.00

Running σ²

1.00

Distribution Evolution

How distributions change through batch normalization

Feature 0

μ: 0.00 | σ²: 1.00

Input Distribution

Original data

μ: 0.00

σ²: 1.00

2 samples

1 samples

-2.40.6

10 bins, 8 samples

Normalized Distribution

Zero mean, unit variance

No data available

Output Distribution

γ=1, β=0

No data available

Feature 1

μ: 0.00 | σ²: 1.00

Input Distribution

Original data

μ: 0.00

σ²: 1.00

1 samples

2 samples

1 samples

2 samples

3.412.3

10 bins, 8 samples

Normalized Distribution

Zero mean, unit variance

No data available

Output Distribution

γ=1, β=0

No data available

Feature 2

μ: 0.00 | σ²: 1.00

Input Distribution

Original data

μ: 0.00

σ²: 1.00

1 samples

2 samples

1 samples

2 samples

1 samples

10.021.3

10 bins, 8 samples

Normalized Distribution

Zero mean, unit variance

No data available

Output Distribution

γ=1, β=0

No data available

Feature 3

μ: 0.00 | σ²: 1.00

Input Distribution

Original data

μ: 0.00

σ²: 1.00

1 samples

2 samples

1 samples

2 samples

1 samples

23.632.0

10 bins, 8 samples

Normalized Distribution

Zero mean, unit variance

No data available

Output Distribution

γ=1, β=0

No data available

Distribution Insights

• Input: Original data with varying means and scales per feature

• Normalized: Centered at zero with unit variance - notice the red line at zero

• Output: Allows network to learn optimal scale (γ) and shift (β) for each feature

💡 Try changing distributions and γ/β parameters to see dramatic effects!

Key Insights

•Batch Normalization normalizes inputs to have zero mean and unit variance per feature
•Training mode uses batch statistics; Inference mode uses running averages
•γ and β are learnable parameters that allow the network to undo normalization if needed
•Benefits: Faster training, higher learning rates, less sensitive to initialization
•Prevents internal covariate shift - distribution changes between layers
•ε = 0.00001 prevents division by zero in the normalization

The Problem: Internal Covariate Shift

Batch normalization addresses a fundamental problem in deep learning called Internal Covariate Shift. This refers to the change in distribution of layer inputs during training, which makes training slower and less stable.

How Batch Normalization Works

Batch normalization normalizes the inputs to a layer by adjusting and scaling the activations. For each feature in a batch, it:

Computes batch statistics: Calculate the mean (μ) and variance (σ²) across the batch
Normalizes: Transform inputs to have zero mean and unit variance
Scales and shifts: Apply learnable parameters γ (scale) and β (shift) to restore representational power

Mathematical Formula

The batch normalization transformation can be expressed as:

μ = 1m Σ_i=1^m x_i

Batch mean

σ² = 1m Σ_i=1^m (x_i - μ)²

Batch variance

x̂ = x - μ√(σ² + ε)

Normalize

y = γ · x̂ + β

Scale and shift

Where:

m is the batch size
xᵢ are the input activations
ε is a small constant (typically 1e-5) to prevent division by zero
γ and β are learnable parameters

Training vs Inference

Training Mode

Uses current batch statistics (μ, σ²) for normalization
Updates running averages of mean and variance using exponential moving average
Learnable parameters γ and β are updated through backpropagation

Inference Mode

Uses running averages computed during training instead of batch statistics
This ensures consistent behavior regardless of batch size during inference
No updates to running statistics or learnable parameters

Key Benefits

1. Accelerated Training

Enables higher learning rates by reducing sensitivity to parameter initialization
Networks converge faster and more reliably

2. Improved Gradient Flow

Reduces internal covariate shift - the change in distribution of layer inputs during training
Helps mitigate vanishing/exploding gradient problems

3. Regularization Effect

Acts as implicit regularization by adding noise through batch statistics
Often reduces the need for other regularization techniques like dropout

4. Reduced Sensitivity to Initialization

Networks become less dependent on careful weight initialization
More robust training across different initialization schemes

Implementation Considerations

Placement in Network

# Typical placement: after linear transformation, before activation
x = conv_layer(x)
x = batch_norm(x)
x = activation(x)

Channel-wise Normalization

For convolutional layers, normalization is applied per channel
Each channel has its own γ and β parameters
Statistics are computed across batch, height, and width dimensions

Momentum Parameter

Controls the update rate of running statistics
Typical value: 0.9 (90% old value, 10% new batch value)
Higher momentum = more stable running statistics

Variations and Extensions

Layer Normalization

Normalizes across features instead of batch dimension
Better for variable-length sequences (RNNs, Transformers)
Not dependent on batch size

Group Normalization

Divides channels into groups and normalizes within each group
Effective for small batch sizes where batch statistics are unreliable

Instance Normalization

Normalizes each sample independently
Popular in style transfer and generative models

When to Use Batch Normalization

Recommended For:

Convolutional Neural Networks - Especially deep architectures
Fully Connected Networks - When training deep feedforward networks
Computer Vision Tasks - Image classification, detection, segmentation

Consider Alternatives For:

Small Batch Sizes - Batch statistics become unreliable
Recurrent Networks - Layer normalization often works better
Online Learning - When batch statistics aren't available

Common Pitfalls

1. Batch Size Dependency

Very small batches lead to noisy statistics
Batch size of 1 makes batch norm equivalent to instance norm

2. Training/Inference Mismatch

Must ensure proper mode switching between training and inference
Running statistics must be properly maintained

3. Learning Rate Adjustment

Batch norm allows higher learning rates, but requires tuning
Too high learning rates can still cause instability

Skip Connections - Often used together with batch norm in modern architectures
Internal Covariate Shift - The problem batch normalization addresses
Gradient Flow - How batch norm improves gradient propagation
Layer Normalization - Alternative normalization technique
Residual Networks - Architecture that popularized batch normalization

Sample	F0	F1	F2	F3
0	--	--	--	--
1	--	--	--	--
2	--	--	--	--
3	--	--	--	--
4	--	--	--	--
5	--	--	--	--
6	--	--	--	--
7	--	--	--	--

Sample	F0	F1	F2	F3
0	--	--	--	--
1	--	--	--	--
2	--	--	--	--
3	--	--	--	--
4	--	--	--	--
5	--	--	--	--
6	--	--	--	--
7	--	--	--	--

Sample	F0	F1	F2	F3
0	--	--	--	--
1	--	--	--	--
2	--	--	--	--
3	--	--	--	--
4	--	--	--	--
5	--	--	--	--
6	--	--	--	--
7	--	--	--	--

Sample	F0	F1	F2	F3
0	--	--	--	--
1	--	--	--	--
2	--	--	--	--
3	--	--	--	--
4	--	--	--	--
5	--	--	--	--
6	--	--	--	--
7	--	--	--	--

Table of Contents

Batch Normalization Visualization

Batch Normalization Demo

Batch Normalization Formula

Input Batch

Normalized Data

Output

Batch Statistics

Learnable Parameters & Running Stats

Distribution Evolution

Feature 0

Input Distribution

Normalized Distribution

Output Distribution

Feature 1

Input Distribution

Normalized Distribution

Output Distribution

Feature 2

Input Distribution

Normalized Distribution

Output Distribution

Feature 3

Input Distribution

Normalized Distribution

Output Distribution

Key Insights

Related Concepts

Internal Covariate Shift

Skip Connections

Self-Attention in ViT

Sample	F0	F1	F2	F3
0	--	--	--	--
1	--	--	--	--
2	--	--	--	--
3	--	--	--	--
4	--	--	--	--
5	--	--	--	--
6	--	--	--	--
7	--	--	--	--

Sample	F0	F1	F2	F3
0	--	--	--	--
1	--	--	--	--
2	--	--	--	--
3	--	--	--	--
4	--	--	--	--
5	--	--	--	--
6	--	--	--	--
7	--	--	--	--