Calculus for Machine Learning
Essential calculus concepts for understanding gradients, optimization, and backpropagation
Best viewed on desktop for optimal interactive experience
Why Calculus in Machine Learning?
Calculus is the mathematics of change and is essential for:
- Optimization: Finding minima/maxima of loss functions
- Gradient Descent: Computing direction of steepest descent
- Backpropagation: Propagating errors through neural networks
- Understanding Dynamics: How models change during training
Interactive Visualization
Interactive Calculus Concepts
Interpretation: The derivative f'(x) tells us the instantaneous rate of change. Function has a critical point (flat tangent).
Derivatives: The Foundation
What is a Derivative?
The derivative measures the rate of change of a function:
f'(x) = lim[h→0] (f(x+h) - f(x)) / h
Geometric Interpretation: Slope of the tangent line at a point
Physical Interpretation: Instantaneous rate of change
Common Derivatives
Function | Derivative |
---|---|
c (constant) | 0 |
x^n | nx^(n-1) |
e^x | e^x |
ln(x) | 1/x |
sin(x) | cos(x) |
cos(x) | -sin(x) |
sigmoid(x) | sigmoid(x)(1 - sigmoid(x)) |
tanh(x) | 1 - tanh²(x) |
Rules of Differentiation
- Linearity: (af + bg)' = af' + bg'
- Product Rule: (fg)' = f'g + fg'
- Chain Rule: (f∘g)' = f'(g(x)) · g'(x)
- Quotient Rule: (f/g)' = (f'g - fg')/g²
Partial Derivatives
For functions of multiple variables f(x, y, z):
∂f/∂x = Rate of change with respect to x (y, z held constant)
Gradient Vector
The gradient combines all partial derivatives:
∇f = [∂f/∂x, ∂f/∂y, ∂f/∂z]
Properties:
- Points in direction of steepest increase
- Perpendicular to level curves/surfaces
- Magnitude indicates rate of change
Chain Rule in Deep Learning
Forward Pass
Given nested functions:
z = f(g(h(x)))
Output is computed layer by layer:
a = h(x) b = g(a) z = f(b)
Backward Pass (Backpropagation)
Derivatives are computed in reverse:
dz/dx = dz/db · db/da · da/dx = f'(b) · g'(a) · h'(x)
Neural Network Example
# Forward pass z1 = W1 @ x + b1 # Linear a1 = relu(z1) # Activation z2 = W2 @ a1 + b2 # Linear loss = cross_entropy(z2, y) # Loss # Backward pass (chain rule) d_loss/d_z2 = ∂cross_entropy/∂z2 d_loss/d_W2 = d_loss/d_z2 @ a1.T d_loss/d_a1 = W2.T @ d_loss/d_z2 d_loss/d_z1 = d_loss/d_a1 * relu'(z1) d_loss/d_W1 = d_loss/d_z1 @ x.T
Optimization Fundamentals
Critical Points
Where f'(x) = 0:
- Local minimum: f''(x) > 0
- Local maximum: f''(x) < 0
- Saddle point: f''(x) = 0 or changes sign
Gradient Descent
Basic update rule:
θ(t+1) = θ(t) - α∇f(θ(t))
Where:
- θ: Parameters
- α: Learning rate
- ∇f: Gradient of loss function
Variants of Gradient Descent
- Batch GD: Use entire dataset
- Stochastic GD: Use single sample
- Mini-batch GD: Use subset of data
- Momentum: Add velocity term
- Adam: Adaptive learning rates
Common Loss Functions and Derivatives
Mean Squared Error (MSE)
L = (1/n)Σ(y_i - ŷ_i)² ∂L/∂ŷ_i = (2/n)(ŷ_i - y_i)
Cross-Entropy Loss
L = -Σ y_i log(ŷ_i) ∂L/∂ŷ_i = -y_i/ŷ_i
Binary Cross-Entropy
L = -[y log(ŷ) + (1-y)log(1-ŷ)] ∂L/∂ŷ = (ŷ - y)/(ŷ(1-ŷ))
Activation Functions and Derivatives
ReLU
f(x) = max(0, x) f'(x) = {1 if x > 0, 0 if x ≤ 0}
Sigmoid
f(x) = 1/(1 + e^(-x)) f'(x) = f(x)(1 - f(x))
Tanh
f(x) = (e^x - e^(-x))/(e^x + e^(-x)) f'(x) = 1 - f(x)²
Softmax
f(x_i) = e^(x_i) / Σe^(x_j) ∂f_i/∂x_j = f_i(δ_ij - f_j)
Automatic Differentiation
Modern frameworks compute derivatives automatically:
Computational Graph
# Forward pass builds graph x = Variable(2.0) y = Variable(3.0) z = x * y # z = 6 w = z + x # w = 8 loss = w ** 2 # loss = 64 # Backward pass computes gradients loss.backward() # x.grad = 2w(y + 1) = 2*8*4 = 64 # y.grad = 2wx = 2*8*2 = 32
Benefits
- No manual derivative calculation
- Handles complex architectures
- Efficient computation
- Exact derivatives (not numerical)
Multivariate Calculus
Jacobian Matrix
For vector function f: ℝⁿ → ℝᵐ:
J = [∂f_i/∂x_j] = [∂f₁/∂x₁ ... ∂f₁/∂xₙ] [ ⋮ ⋱ ⋮ ] [∂fₘ/∂x₁ ... ∂fₘ/∂xₙ]
Hessian Matrix
Second-order derivatives:
H = [∂²f/∂x_i∂x_j]
Uses:
- Newton's method optimization
- Analyzing convexity
- Finding saddle points
Integration in ML
Probability Distributions
P(a ≤ X ≤ b) = ∫[a,b] f(x)dx
Expected Values
E[X] = ∫ x·f(x)dx
Marginalization
p(x) = ∫ p(x,y)dy
Optimization Techniques
Newton's Method
x(n+1) = x(n) - H⁻¹∇f
Faster convergence but expensive Hessian computation.
Conjugate Gradient
Efficient for large-scale problems without computing Hessian.
L-BFGS
Approximates Hessian using gradient history.
Practical Tips
Numerical Stability
- Gradient clipping: Prevent exploding gradients
- Log-sum-exp trick: Avoid overflow in softmax
- Batch normalization: Stabilize intermediate values
Debugging Gradients
- Gradient checking: Compare with numerical gradients
grad_numerical = (f(x+ε) - f(x-ε))/(2ε)
-
Visualize gradients: Plot histogram of gradient values
-
Monitor gradient norms: Track ||∇θ|| during training
Common Pitfalls
- Vanishing gradients: Deep networks, wrong activation
- Exploding gradients: Large learning rates, RNNs
- Saddle points: Common in high dimensions
- Local minima: Non-convex optimization
- Numerical errors: Accumulation in long chains
Advanced Topics
Stochastic Calculus
For understanding:
- Stochastic gradient descent dynamics
- Diffusion models
- Brownian motion in optimization
Variational Calculus
For:
- Variational autoencoders
- Optimal control
- Physics-informed neural networks
Differential Geometry
For:
- Natural gradients
- Information geometry
- Manifold learning
Summary
Calculus provides the mathematical machinery for:
- Computing gradients for optimization
- Understanding how changes propagate
- Analyzing convergence and stability
- Developing new algorithms
Master these concepts to understand not just how to use ML algorithms, but why they work and how to improve them.