Calculus for Machine Learning

12 min

Essential calculus concepts for understanding gradients, optimization, and backpropagation

Best viewed on desktop for optimal interactive experience

Why Calculus in Machine Learning?

Calculus is the mathematics of change and is essential for:

  • Optimization: Finding minima/maxima of loss functions
  • Gradient Descent: Computing direction of steepest descent
  • Backpropagation: Propagating errors through neural networks
  • Understanding Dynamics: How models change during training

Interactive Visualization

Interactive Calculus Concepts

f(x) = x² - 2x + 2
f(1.00)
1.000
f'(1.00)
0.000
Slope of tangent

Interpretation: The derivative f'(x) tells us the instantaneous rate of change. Function has a critical point (flat tangent).

Derivatives: The Foundation

What is a Derivative?

The derivative measures the rate of change of a function:

f'(x) = lim[h→0] (f(x+h) - f(x)) / h

Geometric Interpretation: Slope of the tangent line at a point

Physical Interpretation: Instantaneous rate of change

Common Derivatives

FunctionDerivative
c (constant)0
x^nnx^(n-1)
e^xe^x
ln(x)1/x
sin(x)cos(x)
cos(x)-sin(x)
sigmoid(x)sigmoid(x)(1 - sigmoid(x))
tanh(x)1 - tanh²(x)

Rules of Differentiation

  1. Linearity: (af + bg)' = af' + bg'
  2. Product Rule: (fg)' = f'g + fg'
  3. Chain Rule: (f∘g)' = f'(g(x)) · g'(x)
  4. Quotient Rule: (f/g)' = (f'g - fg')/g²

Partial Derivatives

For functions of multiple variables f(x, y, z):

∂f/∂x = Rate of change with respect to x (y, z held constant)

Gradient Vector

The gradient combines all partial derivatives:

∇f = [∂f/∂x, ∂f/∂y, ∂f/∂z]

Properties:

  • Points in direction of steepest increase
  • Perpendicular to level curves/surfaces
  • Magnitude indicates rate of change

Chain Rule in Deep Learning

Forward Pass

Given nested functions:

z = f(g(h(x)))

Output is computed layer by layer:

a = h(x) b = g(a) z = f(b)

Backward Pass (Backpropagation)

Derivatives are computed in reverse:

dz/dx = dz/db · db/da · da/dx = f'(b) · g'(a) · h'(x)

Neural Network Example

# Forward pass z1 = W1 @ x + b1 # Linear a1 = relu(z1) # Activation z2 = W2 @ a1 + b2 # Linear loss = cross_entropy(z2, y) # Loss # Backward pass (chain rule) d_loss/d_z2 = ∂cross_entropy/∂z2 d_loss/d_W2 = d_loss/d_z2 @ a1.T d_loss/d_a1 = W2.T @ d_loss/d_z2 d_loss/d_z1 = d_loss/d_a1 * relu'(z1) d_loss/d_W1 = d_loss/d_z1 @ x.T

Optimization Fundamentals

Critical Points

Where f'(x) = 0:

  • Local minimum: f''(x) > 0
  • Local maximum: f''(x) < 0
  • Saddle point: f''(x) = 0 or changes sign

Gradient Descent

Basic update rule:

θ(t+1) = θ(t) - α∇f(θ(t))

Where:

  • θ: Parameters
  • α: Learning rate
  • ∇f: Gradient of loss function

Variants of Gradient Descent

  1. Batch GD: Use entire dataset
  2. Stochastic GD: Use single sample
  3. Mini-batch GD: Use subset of data
  4. Momentum: Add velocity term
  5. Adam: Adaptive learning rates

Common Loss Functions and Derivatives

Mean Squared Error (MSE)

L = (1/n)Σ(y_i - ŷ_i)² ∂L/∂ŷ_i = (2/n)(ŷ_i - y_i)

Cross-Entropy Loss

L = -Σ y_i log(ŷ_i) ∂L/∂ŷ_i = -y_i/ŷ_i

Binary Cross-Entropy

L = -[y log(ŷ) + (1-y)log(1-ŷ)] ∂L/∂ŷ = (ŷ - y)/(ŷ(1-ŷ))

Activation Functions and Derivatives

ReLU

f(x) = max(0, x) f'(x) = {1 if x > 0, 0 if x ≤ 0}

Sigmoid

f(x) = 1/(1 + e^(-x)) f'(x) = f(x)(1 - f(x))

Tanh

f(x) = (e^x - e^(-x))/(e^x + e^(-x)) f'(x) = 1 - f(x)²

Softmax

f(x_i) = e^(x_i) / Σe^(x_j) ∂f_i/∂x_j = f_i(δ_ij - f_j)

Automatic Differentiation

Modern frameworks compute derivatives automatically:

Computational Graph

# Forward pass builds graph x = Variable(2.0) y = Variable(3.0) z = x * y # z = 6 w = z + x # w = 8 loss = w ** 2 # loss = 64 # Backward pass computes gradients loss.backward() # x.grad = 2w(y + 1) = 2*8*4 = 64 # y.grad = 2wx = 2*8*2 = 32

Benefits

  • No manual derivative calculation
  • Handles complex architectures
  • Efficient computation
  • Exact derivatives (not numerical)

Multivariate Calculus

Jacobian Matrix

For vector function f: ℝⁿ → ℝᵐ:

J = [∂f_i/∂x_j] = [∂f₁/∂x₁ ... ∂f₁/∂xₙ] [ ⋮ ⋱ ⋮ ] [∂fₘ/∂x₁ ... ∂fₘ/∂xₙ]

Hessian Matrix

Second-order derivatives:

H = [∂²f/∂x_i∂x_j]

Uses:

  • Newton's method optimization
  • Analyzing convexity
  • Finding saddle points

Integration in ML

Probability Distributions

P(a ≤ X ≤ b) = ∫[a,b] f(x)dx

Expected Values

E[X] = ∫ x·f(x)dx

Marginalization

p(x) = ∫ p(x,y)dy

Optimization Techniques

Newton's Method

x(n+1) = x(n) - H⁻¹∇f

Faster convergence but expensive Hessian computation.

Conjugate Gradient

Efficient for large-scale problems without computing Hessian.

L-BFGS

Approximates Hessian using gradient history.

Practical Tips

Numerical Stability

  1. Gradient clipping: Prevent exploding gradients
  2. Log-sum-exp trick: Avoid overflow in softmax
  3. Batch normalization: Stabilize intermediate values

Debugging Gradients

  1. Gradient checking: Compare with numerical gradients
grad_numerical = (f(x+ε) - f(x-ε))/(2ε)
  1. Visualize gradients: Plot histogram of gradient values

  2. Monitor gradient norms: Track ||∇θ|| during training

Common Pitfalls

  1. Vanishing gradients: Deep networks, wrong activation
  2. Exploding gradients: Large learning rates, RNNs
  3. Saddle points: Common in high dimensions
  4. Local minima: Non-convex optimization
  5. Numerical errors: Accumulation in long chains

Advanced Topics

Stochastic Calculus

For understanding:

  • Stochastic gradient descent dynamics
  • Diffusion models
  • Brownian motion in optimization

Variational Calculus

For:

  • Variational autoencoders
  • Optimal control
  • Physics-informed neural networks

Differential Geometry

For:

  • Natural gradients
  • Information geometry
  • Manifold learning

Summary

Calculus provides the mathematical machinery for:

  • Computing gradients for optimization
  • Understanding how changes propagate
  • Analyzing convergence and stability
  • Developing new algorithms

Master these concepts to understand not just how to use ML algorithms, but why they work and how to improve them.

If you found this explanation helpful, consider sharing it with others.

Mastodon