Gradient Descent

Gradient Descent Optimization

Gradient descent is the workhorse of deep learning - it's how neural networks learn. Watch how it finds the minimum of a loss function step by step.

Gradient Descent Visualization

Parameters

Loss Function

Learning Rate0.1

Starting Point

Show 2D visualization

Loss function

Gradient descent path

Optimal point

Tip: Try different learning rates to see how they affect convergence speed. Too high and it might overshoot, too low and it converges slowly.

The Core Idea

Gradient descent minimizes a function by taking steps proportional to the negative gradient:

$\theta_{t+1} = \theta_t - \alpha \nabla f(\theta_t)$

Where:

θ: Parameters (weights)
α: Learning rate (step size)
∇f: Gradient (direction of steepest increase)

Simple Example

Minimize f(x) = x²:

import numpy as np

def f(x):
    return x**2

def gradient(x):
    return 2*x

# Gradient descent
x = 5.0  # Start point
learning_rate = 0.1

for i in range(20):
    grad = gradient(x)
    x = x - learning_rate * grad
    print(f"Step {i}: x={x:.3f}, f(x)={f(x):.3f}")

# Converges to x=0, the minimum

Neural Network Training

import torch
import torch.nn as nn

# Model
model = nn.Linear(10, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(100):
    # Forward pass
    predictions = model(X)
    loss = loss_fn(predictions, y)
    
    # Backward pass
    optimizer.zero_grad()  # Clear previous gradients
    loss.backward()        # Compute gradients
    optimizer.step()       # Update weights

Variants of Gradient Descent

1. Batch Gradient Descent

Uses entire dataset for each update:

# Compute gradient over all data
gradient = (1/N) * sum([grad_fn(x_i, y_i) for i in range(N)])
weights = weights - learning_rate * gradient

2. Stochastic Gradient Descent (SGD)

Uses one sample at a time:

for x_i, y_i in dataset:
    gradient = grad_fn(x_i, y_i)
    weights = weights - learning_rate * gradient

3. Mini-batch Gradient Descent

Best of both worlds:

for batch in dataloader:
    gradient = (1/batch_size) * sum([grad_fn(x, y) for x, y in batch])
    weights = weights - learning_rate * gradient

Learning Rate Matters

Too small → Slow convergence Too large → Oscillation or divergence

# Learning rate schedules
def exponential_decay(initial_lr, epoch, decay_rate=0.96):
    return initial_lr * (decay_rate ** epoch)

def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=10):
    return initial_lr * (drop_rate ** (epoch // epochs_drop))

def cosine_annealing(initial_lr, epoch, total_epochs):
    return initial_lr * (1 + np.cos(np.pi * epoch / total_epochs)) / 2

Momentum Methods

Classical Momentum

Accumulates velocity:

velocity = 0
momentum = 0.9

for step in range(num_steps):
    gradient = compute_gradient(weights)
    velocity = momentum * velocity - learning_rate * gradient
    weights = weights + velocity

Adam Optimizer

Adaptive learning rates:

# Combines momentum and RMSprop
m = beta1 * m + (1 - beta1) * gradient      # Momentum
v = beta2 * v + (1 - beta2) * gradient**2   # RMSprop
m_hat = m / (1 - beta1**t)                  # Bias correction
v_hat = v / (1 - beta2**t)
weights = weights - lr * m_hat / (sqrt(v_hat) + epsilon)

Challenges & Solutions

1. Local Minima

Problem: Stuck in suboptimal solutions
Solution: Momentum, random restarts, simulated annealing

2. Saddle Points

Problem: Zero gradient but not minimum
Solution: Second-order methods (Newton, L-BFGS)

3. Vanishing/Exploding Gradients

Problem: Gradients become too small/large
Solution: Batch normalization, gradient clipping, careful initialization

Gradient Computation

Manual Calculation

# Forward pass
x = 2
w = 3
b = 1
y = w * x + b  # y = 7

# Backward pass (chain rule)
dy_dw = x      # = 2
dy_db = 1      # = 1
dy_dx = w      # = 3

Automatic Differentiation

import torch

x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

y = w * x + b
y.backward()

print(w.grad)  # 2.0
print(b.grad)  # 1.0
print(x.grad)  # 3.0

Convergence Analysis

Convex Functions

Unique global minimum
Guaranteed convergence with proper learning rate
Rate: O(1/t) for SGD, O(1/t²) for accelerated methods

Non-convex Functions (Neural Networks)

Multiple local minima
No convergence guarantees
Empirically works well in practice

Practical Tips

Normalize inputs: Zero mean, unit variance
Initialize carefully: Xavier/He initialization
Monitor loss: Check for divergence
Use validation set: Detect overfitting
Gradient clipping: Prevent explosions
Batch normalization: Stabilize training

Advanced Topics

Second-Order Methods

Use Hessian (second derivatives):

# Newton's method
gradient = compute_gradient(weights)
hessian = compute_hessian(weights)
weights = weights - np.linalg.inv(hessian) @ gradient

Natural Gradient

Uses Fisher information matrix:

# Better for probability distributions
gradient = compute_gradient(weights)
fisher = compute_fisher_matrix(weights)
weights = weights - learning_rate * np.linalg.inv(fisher) @ gradient

Next Steps

Explore Optimization Landscapes
Learn about Chain Rule
Understand Momentum Methods

Table of Contents

Gradient Descent Visualization

Parameters