Gradient Descent

8 min

Visualize gradient descent optimization - how neural networks learn by following gradients.

Best viewed on desktop for optimal interactive experience

Gradient Descent Optimization

Gradient descent is the workhorse of deep learning - it's how neural networks learn. Watch how it finds the minimum of a loss function step by step.

Controls

Visualization

Loss function
Gradient descent path
Optimal point

The Core Idea

Gradient descent minimizes a function by taking steps proportional to the negative gradient:

θt+1=θtαf(θt)\theta_{t+1} = \theta_t - \alpha \nabla f(\theta_t)

Where:

  • θ: Parameters (weights)
  • α: Learning rate (step size)
  • ∇f: Gradient (direction of steepest increase)

Simple Example

Minimize f(x) = x²:

import numpy as np def f(x): return x**2 def gradient(x): return 2*x # Gradient descent x = 5.0 # Start point learning_rate = 0.1 for i in range(20): grad = gradient(x) x = x - learning_rate * grad print(f"Step {i}: x={x:.3f}, f(x)={f(x):.3f}") # Converges to x=0, the minimum

Neural Network Training

import torch import torch.nn as nn # Model model = nn.Linear(10, 1) loss_fn = nn.MSELoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01) # Training loop for epoch in range(100): # Forward pass predictions = model(X) loss = loss_fn(predictions, y) # Backward pass optimizer.zero_grad() # Clear previous gradients loss.backward() # Compute gradients optimizer.step() # Update weights

Variants of Gradient Descent

1. Batch Gradient Descent

Uses entire dataset for each update:

# Compute gradient over all data gradient = (1/N) * sum([grad_fn(x_i, y_i) for i in range(N)]) weights = weights - learning_rate * gradient

2. Stochastic Gradient Descent (SGD)

Uses one sample at a time:

for x_i, y_i in dataset: gradient = grad_fn(x_i, y_i) weights = weights - learning_rate * gradient

3. Mini-batch Gradient Descent

Best of both worlds:

for batch in dataloader: gradient = (1/batch_size) * sum([grad_fn(x, y) for x, y in batch]) weights = weights - learning_rate * gradient

Learning Rate Matters

Too small → Slow convergence Too large → Oscillation or divergence

# Learning rate schedules def exponential_decay(initial_lr, epoch, decay_rate=0.96): return initial_lr * (decay_rate ** epoch) def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=10): return initial_lr * (drop_rate ** (epoch // epochs_drop)) def cosine_annealing(initial_lr, epoch, total_epochs): return initial_lr * (1 + np.cos(np.pi * epoch / total_epochs)) / 2

Momentum Methods

Classical Momentum

Accumulates velocity:

velocity = 0 momentum = 0.9 for step in range(num_steps): gradient = compute_gradient(weights) velocity = momentum * velocity - learning_rate * gradient weights = weights + velocity

Adam Optimizer

Adaptive learning rates:

# Combines momentum and RMSprop m = beta1 * m + (1 - beta1) * gradient # Momentum v = beta2 * v + (1 - beta2) * gradient**2 # RMSprop m_hat = m / (1 - beta1**t) # Bias correction v_hat = v / (1 - beta2**t) weights = weights - lr * m_hat / (sqrt(v_hat) + epsilon)

Challenges & Solutions

1. Local Minima

  • Problem: Stuck in suboptimal solutions
  • Solution: Momentum, random restarts, simulated annealing

2. Saddle Points

  • Problem: Zero gradient but not minimum
  • Solution: Second-order methods (Newton, L-BFGS)

3. Vanishing/Exploding Gradients

  • Problem: Gradients become too small/large
  • Solution: Batch normalization, gradient clipping, careful initialization

Gradient Computation

Manual Calculation

# Forward pass x = 2 w = 3 b = 1 y = w * x + b # y = 7 # Backward pass (chain rule) dy_dw = x # = 2 dy_db = 1 # = 1 dy_dx = w # = 3

Automatic Differentiation

import torch x = torch.tensor(2.0, requires_grad=True) w = torch.tensor(3.0, requires_grad=True) b = torch.tensor(1.0, requires_grad=True) y = w * x + b y.backward() print(w.grad) # 2.0 print(b.grad) # 1.0 print(x.grad) # 3.0

Convergence Analysis

Convex Functions

  • Unique global minimum
  • Guaranteed convergence with proper learning rate
  • Rate: O(1/t) for SGD, O(1/t²) for accelerated methods

Non-convex Functions (Neural Networks)

  • Multiple local minima
  • No convergence guarantees
  • Empirically works well in practice

Practical Tips

  1. Normalize inputs: Zero mean, unit variance
  2. Initialize carefully: Xavier/He initialization
  3. Monitor loss: Check for divergence
  4. Use validation set: Detect overfitting
  5. Gradient clipping: Prevent explosions
  6. Batch normalization: Stabilize training

Advanced Topics

Second-Order Methods

Use Hessian (second derivatives):

# Newton's method gradient = compute_gradient(weights) hessian = compute_hessian(weights) weights = weights - np.linalg.inv(hessian) @ gradient

Natural Gradient

Uses Fisher information matrix:

# Better for probability distributions gradient = compute_gradient(weights) fisher = compute_fisher_matrix(weights) weights = weights - learning_rate * np.linalg.inv(fisher) @ gradient

Next Steps

If you found this explanation helpful, consider sharing it with others.

Mastodon