Calculus for Machine Learning

Why Calculus in Machine Learning?

Calculus is the mathematics of change and is essential for:

Optimization: Finding minima/maxima of loss functions
Gradient Descent: Computing direction of steepest descent
Backpropagation: Propagating errors through neural networks
Understanding Dynamics: How models change during training

Interactive Visualization

Interactive Calculus Concepts

Function

f(x) = x² - 2x + 2

Point x = 1.00

f(1.00):1.000

f'(1.00):0.000

Interpretation: The derivative f'(x) tells us the instantaneous rate of change. Function has a critical point (flat tangent).

Function

Point

Tangent line

Derivative vector

Derivatives: The Foundation

What is a Derivative?

The derivative measures the rate of change of a function:

f'(x) = lim[h→0] (f(x+h) - f(x)) / h

Geometric Interpretation: Slope of the tangent line at a point

Physical Interpretation: Instantaneous rate of change

Common Derivatives

Function	Derivative
c (constant)	0
x^n	nx^(n-1)
e^x	e^x
ln(x)	1/x
sin(x)	cos(x)
cos(x)	-sin(x)
sigmoid(x)	sigmoid(x)(1 - sigmoid(x))
tanh(x)	1 - tanh²(x)

Rules of Differentiation

Linearity: (af + bg)' = af' + bg'
Product Rule: (fg)' = f'g + fg'
Chain Rule: (f∘g)' = f'(g(x)) · g'(x)
Quotient Rule: (f/g)' = (f'g - fg')/g²

Partial Derivatives

For functions of multiple variables f(x, y, z):

∂f/∂x = Rate of change with respect to x (y, z held constant)

Gradient Vector

The gradient combines all partial derivatives:

∇f = [∂f/∂x, ∂f/∂y, ∂f/∂z]

Properties:

Points in direction of steepest increase
Perpendicular to level curves/surfaces
Magnitude indicates rate of change

Chain Rule in Deep Learning

Forward Pass

Given nested functions:

z = f(g(h(x)))

Output is computed layer by layer:

a = h(x)
b = g(a)
z = f(b)

Backward Pass (Backpropagation)

Derivatives are computed in reverse:

dz/dx = dz/db · db/da · da/dx
      = f'(b) · g'(a) · h'(x)

Neural Network Example

# Forward pass
z1 = W1 @ x + b1          # Linear
a1 = relu(z1)             # Activation
z2 = W2 @ a1 + b2         # Linear
loss = cross_entropy(z2, y)  # Loss

# Backward pass (chain rule)
d_loss/d_z2 = ∂cross_entropy/∂z2
d_loss/d_W2 = d_loss/d_z2 @ a1.T
d_loss/d_a1 = W2.T @ d_loss/d_z2
d_loss/d_z1 = d_loss/d_a1 * relu'(z1)
d_loss/d_W1 = d_loss/d_z1 @ x.T

Optimization Fundamentals

Critical Points

Where f'(x) = 0:

Local minimum: f''(x) > 0
Local maximum: f''(x) < 0
Saddle point: f''(x) = 0 or changes sign

Gradient Descent

Basic update rule:

θ(t+1) = θ(t) - α∇f(θ(t))

Where:

θ: Parameters
α: Learning rate
∇f: Gradient of loss function

Variants of Gradient Descent

Batch GD: Use entire dataset
Stochastic GD: Use single sample
Mini-batch GD: Use subset of data
Momentum: Add velocity term
Adam: Adaptive learning rates

Common Loss Functions and Derivatives

Mean Squared Error (MSE)

L = (1/n)Σ(y_i - ŷ_i)²
∂L/∂ŷ_i = (2/n)(ŷ_i - y_i)

Cross-Entropy Loss

L = -Σ y_i log(ŷ_i)
∂L/∂ŷ_i = -y_i/ŷ_i

Binary Cross-Entropy

L = -[y log(ŷ) + (1-y)log(1-ŷ)]
∂L/∂ŷ = (ŷ - y)/(ŷ(1-ŷ))

Activation Functions and Derivatives

ReLU

f(x) = max(0, x)
f'(x) = {1 if x > 0, 0 if x ≤ 0}

Sigmoid

f(x) = 1/(1 + e^(-x))
f'(x) = f(x)(1 - f(x))

Tanh

f(x) = (e^x - e^(-x))/(e^x + e^(-x))
f'(x) = 1 - f(x)²

Softmax

f(x_i) = e^(x_i) / Σe^(x_j)
∂f_i/∂x_j = f_i(δ_ij - f_j)

Automatic Differentiation

Modern frameworks compute derivatives automatically:

Computational Graph

# Forward pass builds graph
x = Variable(2.0)
y = Variable(3.0)
z = x * y        # z = 6
w = z + x        # w = 8
loss = w ** 2    # loss = 64

# Backward pass computes gradients
loss.backward()
# x.grad = 2w(y + 1) = 2*8*4 = 64
# y.grad = 2wx = 2*8*2 = 32

Benefits

No manual derivative calculation
Handles complex architectures
Efficient computation
Exact derivatives (not numerical)

Multivariate Calculus

Jacobian Matrix

For vector function f: ℝⁿ → ℝᵐ:

J = [∂f_i/∂x_j] = 
[∂f₁/∂x₁  ...  ∂f₁/∂xₙ]
[   ⋮      ⋱      ⋮    ]
[∂fₘ/∂x₁  ...  ∂fₘ/∂xₙ]

Hessian Matrix

Second-order derivatives:

H = [∂²f/∂x_i∂x_j]

Uses:

Newton's method optimization
Analyzing convexity
Finding saddle points

Integration in ML

Probability Distributions

P(a ≤ X ≤ b) = ∫[a,b] f(x)dx

Expected Values

E[X] = ∫ x·f(x)dx

Marginalization

p(x) = ∫ p(x,y)dy

Optimization Techniques

Newton's Method

x(n+1) = x(n) - H⁻¹∇f

Faster convergence but expensive Hessian computation.

Conjugate Gradient

Efficient for large-scale problems without computing Hessian.

L-BFGS

Approximates Hessian using gradient history.

Practical Tips

Numerical Stability

Gradient clipping: Prevent exploding gradients
Log-sum-exp trick: Avoid overflow in softmax
Batch normalization: Stabilize intermediate values

Debugging Gradients

Gradient checking: Compare with numerical gradients

grad_numerical = (f(x+ε) - f(x-ε))/(2ε)

Visualize gradients: Plot histogram of gradient values
Monitor gradient norms: Track ||∇θ|| during training

Common Pitfalls

Vanishing gradients: Deep networks, wrong activation
Exploding gradients: Large learning rates, RNNs
Saddle points: Common in high dimensions
Local minima: Non-convex optimization
Numerical errors: Accumulation in long chains

Advanced Topics

Stochastic Calculus

For understanding:

Stochastic gradient descent dynamics
Diffusion models
Brownian motion in optimization

Variational Calculus

For:

Variational autoencoders
Optimal control
Physics-informed neural networks

Differential Geometry

For:

Natural gradients
Information geometry
Manifold learning

Summary

Calculus provides the mathematical machinery for:

Computing gradients for optimization
Understanding how changes propagate
Analyzing convergence and stability
Developing new algorithms

Master these concepts to understand not just how to use ML algorithms, but why they work and how to improve them.

Table of Contents

Interactive Calculus Concepts

Function