Gradient Descent
Visualize gradient descent optimization - how neural networks learn by following gradients.
Best viewed on desktop for optimal interactive experience
Gradient Descent Optimization
Gradient descent is the workhorse of deep learning - it's how neural networks learn. Watch how it finds the minimum of a loss function step by step.
Controls
Visualization
Loss function
Gradient descent path
Optimal point
The Core Idea
Gradient descent minimizes a function by taking steps proportional to the negative gradient:
Where:
- θ: Parameters (weights)
- α: Learning rate (step size)
- ∇f: Gradient (direction of steepest increase)
Simple Example
Minimize f(x) = x²:
import numpy as np def f(x): return x**2 def gradient(x): return 2*x # Gradient descent x = 5.0 # Start point learning_rate = 0.1 for i in range(20): grad = gradient(x) x = x - learning_rate * grad print(f"Step {i}: x={x:.3f}, f(x)={f(x):.3f}") # Converges to x=0, the minimum
Neural Network Training
import torch import torch.nn as nn # Model model = nn.Linear(10, 1) loss_fn = nn.MSELoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01) # Training loop for epoch in range(100): # Forward pass predictions = model(X) loss = loss_fn(predictions, y) # Backward pass optimizer.zero_grad() # Clear previous gradients loss.backward() # Compute gradients optimizer.step() # Update weights
Variants of Gradient Descent
1. Batch Gradient Descent
Uses entire dataset for each update:
# Compute gradient over all data gradient = (1/N) * sum([grad_fn(x_i, y_i) for i in range(N)]) weights = weights - learning_rate * gradient
2. Stochastic Gradient Descent (SGD)
Uses one sample at a time:
for x_i, y_i in dataset: gradient = grad_fn(x_i, y_i) weights = weights - learning_rate * gradient
3. Mini-batch Gradient Descent
Best of both worlds:
for batch in dataloader: gradient = (1/batch_size) * sum([grad_fn(x, y) for x, y in batch]) weights = weights - learning_rate * gradient
Learning Rate Matters
Too small → Slow convergence Too large → Oscillation or divergence
# Learning rate schedules def exponential_decay(initial_lr, epoch, decay_rate=0.96): return initial_lr * (decay_rate ** epoch) def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=10): return initial_lr * (drop_rate ** (epoch // epochs_drop)) def cosine_annealing(initial_lr, epoch, total_epochs): return initial_lr * (1 + np.cos(np.pi * epoch / total_epochs)) / 2
Momentum Methods
Classical Momentum
Accumulates velocity:
velocity = 0 momentum = 0.9 for step in range(num_steps): gradient = compute_gradient(weights) velocity = momentum * velocity - learning_rate * gradient weights = weights + velocity
Adam Optimizer
Adaptive learning rates:
# Combines momentum and RMSprop m = beta1 * m + (1 - beta1) * gradient # Momentum v = beta2 * v + (1 - beta2) * gradient**2 # RMSprop m_hat = m / (1 - beta1**t) # Bias correction v_hat = v / (1 - beta2**t) weights = weights - lr * m_hat / (sqrt(v_hat) + epsilon)
Challenges & Solutions
1. Local Minima
- Problem: Stuck in suboptimal solutions
- Solution: Momentum, random restarts, simulated annealing
2. Saddle Points
- Problem: Zero gradient but not minimum
- Solution: Second-order methods (Newton, L-BFGS)
3. Vanishing/Exploding Gradients
- Problem: Gradients become too small/large
- Solution: Batch normalization, gradient clipping, careful initialization
Gradient Computation
Manual Calculation
# Forward pass x = 2 w = 3 b = 1 y = w * x + b # y = 7 # Backward pass (chain rule) dy_dw = x # = 2 dy_db = 1 # = 1 dy_dx = w # = 3
Automatic Differentiation
import torch x = torch.tensor(2.0, requires_grad=True) w = torch.tensor(3.0, requires_grad=True) b = torch.tensor(1.0, requires_grad=True) y = w * x + b y.backward() print(w.grad) # 2.0 print(b.grad) # 1.0 print(x.grad) # 3.0
Convergence Analysis
Convex Functions
- Unique global minimum
- Guaranteed convergence with proper learning rate
- Rate: O(1/t) for SGD, O(1/t²) for accelerated methods
Non-convex Functions (Neural Networks)
- Multiple local minima
- No convergence guarantees
- Empirically works well in practice
Practical Tips
- Normalize inputs: Zero mean, unit variance
- Initialize carefully: Xavier/He initialization
- Monitor loss: Check for divergence
- Use validation set: Detect overfitting
- Gradient clipping: Prevent explosions
- Batch normalization: Stabilize training
Advanced Topics
Second-Order Methods
Use Hessian (second derivatives):
# Newton's method gradient = compute_gradient(weights) hessian = compute_hessian(weights) weights = weights - np.linalg.inv(hessian) @ gradient
Natural Gradient
Uses Fisher information matrix:
# Better for probability distributions gradient = compute_gradient(weights) fisher = compute_fisher_matrix(weights) weights = weights - learning_rate * np.linalg.inv(fisher) @ gradient
Next Steps
- Explore Optimization Landscapes
- Learn about Chain Rule
- Understand Momentum Methods