Linear Algebra Fundamentals
Essential linear algebra concepts for machine learning with interactive visualizations
Best viewed on desktop for optimal interactive experience
What is Linear Algebra?
Linear algebra is the branch of mathematics concerning linear equations, linear functions, and their representations through matrices and vector spaces. It's fundamental to machine learning because:
- Data representation: Features are vectors, datasets are matrices
- Transformations: Neural networks perform linear transformations
- Optimization: Gradient descent operates in vector spaces
- Dimensionality reduction: PCA, SVD rely on linear algebra
Interactive Visualization
Interactive Linear Algebra
Core Concepts
1. Scalars
A scalar is a single number. In ML contexts:
- Learning rate (α = 0.01)
- Regularization parameter (λ = 0.1)
- Individual predictions
2. Vectors
A vector is an ordered array of numbers:
# Column vector (most common in ML) x = [x₁] [x₂] [x₃] # Row vector x = [x₁, x₂, x₃]
Properties:
- Dimension: Number of elements
- Magnitude: ||x|| = √(x₁² + x₂² + ... + xₙ²)
- Direction: Orientation in space
Operations:
- Addition: Element-wise addition
- Scalar multiplication: Multiply each element
- Dot product: x·y = Σ(xᵢ × yᵢ)
3. Matrices
A matrix is a 2D array of numbers:
A = [a₁₁ a₁₂ a₁₃] [a₂₁ a₂₂ a₂₃] [a₃₁ a₃₂ a₃₃]
Properties:
- Shape: (rows, columns)
- Rank: Number of linearly independent rows/columns
- Determinant: Scalar that describes transformation scaling
Operations:
- Addition: Element-wise (same shape required)
- Multiplication: AB ≠ BA (non-commutative)
- Transpose: Flip rows and columns
- Inverse: A⁻¹ such that AA⁻¹ = I
4. Tensors
Generalization to n-dimensional arrays:
- Scalar: 0D tensor
- Vector: 1D tensor
- Matrix: 2D tensor
- 3D+ tensor: Used in deep learning (batch × height × width × channels)
Key Operations for ML
Matrix Multiplication
Essential for neural network forward pass:
# Weight matrix × Input vector y = Wx + b # Where: # W: weight matrix (m × n) # x: input vector (n × 1) # b: bias vector (m × 1) # y: output vector (m × 1)
Dot Product
Measures similarity between vectors:
similarity = x · y = ||x|| ||y|| cos(θ) # Applications: # - Cosine similarity # - Attention mechanisms # - Feature matching
Eigendecomposition
For symmetric matrix A:
A = QΛQ^T
Where:
- Q: Matrix of eigenvectors
- Λ: Diagonal matrix of eigenvalues
Applications:
- PCA (Principal Component Analysis)
- Spectral clustering
- Network analysis
Linear Transformations
Common Transformations
- Scaling: Stretch/shrink along axes
- Rotation: Rotate around origin
- Reflection: Mirror across line
- Shearing: Slant parallel to axis
- Projection: Map to lower dimension
Transformation Matrix Examples
# Scaling by factor of 2 S = [[2, 0], [0, 2]] # Rotation by θ R = [[cos(θ), -sin(θ)], [sin(θ), cos(θ)]] # Reflection across x-axis F = [[1, 0], [0, -1]]
Vector Spaces
Basis Vectors
A set of linearly independent vectors that span the space:
# Standard basis in R² e₁ = [1, 0] e₂ = [0, 1] # Any vector can be expressed as: v = a·e₁ + b·e₂
Subspaces
Important subspaces in ML:
- Column space: Range of possible outputs
- Null space: Inputs that map to zero
- Row space: Space of possible weights
Norms and Distances
Common Norms
# L1 norm (Manhattan distance) ||x||₁ = Σ|xᵢ| # L2 norm (Euclidean distance) ||x||₂ = √(Σxᵢ²) # L∞ norm (Maximum norm) ||x||∞ = max|xᵢ|
Applications in ML
- L1 regularization: Promotes sparsity (Lasso)
- L2 regularization: Prevents large weights (Ridge)
- Distance metrics: k-NN, clustering
Matrix Decompositions
Singular Value Decomposition (SVD)
A = UΣV^T
Applications:
- Dimensionality reduction
- Recommender systems
- Image compression
- Natural language processing
LU Decomposition
A = LU
Where L is lower triangular, U is upper triangular.
Applications:
- Solving linear systems
- Computing determinants
- Matrix inversion
QR Decomposition
A = QR
Where Q is orthogonal, R is upper triangular.
Applications:
- Least squares problems
- Eigenvalue algorithms
- Gram-Schmidt process
Applications in Machine Learning
1. Neural Networks
# Forward propagation z¹ = W¹x + b¹ a¹ = σ(z¹) z² = W²a¹ + b² y = σ(z²)
2. Principal Component Analysis (PCA)
- Center the data: X - μ
- Compute covariance: C = (1/n)X^T X
- Find eigenvectors of C
- Project: X_reduced = XW_k
3. Gradient Descent
# Parameter update θ = θ - α∇J(θ) # Where: # ∇J(θ) is the gradient (vector of partial derivatives) # α is the learning rate (scalar)
4. Attention Mechanisms
# Scaled dot-product attention Attention(Q, K, V) = softmax(QK^T / √d_k)V
Computational Considerations
Time Complexity
- Vector addition: O(n)
- Dot product: O(n)
- Matrix multiplication: O(n³) naive, O(n^2.376) Strassen
- Matrix inversion: O(n³)
- SVD: O(min(m²n, mn²))
Numerical Stability
- Condition number: Measure of sensitivity to input changes
- Ill-conditioned matrices: Small changes cause large effects
- Regularization: Add small values to diagonal for stability
Python Implementation
import numpy as np # Vectors v1 = np.array([1, 2, 3]) v2 = np.array([4, 5, 6]) # Dot product dot = np.dot(v1, v2) # 32 # Matrices A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6], [7, 8]]) # Matrix multiplication C = A @ B # or np.matmul(A, B) # Eigendecomposition eigenvalues, eigenvectors = np.linalg.eig(A) # SVD U, S, Vt = np.linalg.svd(A) # Solve linear system Ax = b b = np.array([1, 2]) x = np.linalg.solve(A, b)
Common Pitfalls
- Broadcasting errors: Shape mismatches in operations
- Singular matrices: No inverse exists
- Numerical precision: Floating-point errors accumulate
- Memory issues: Large matrices exhaust RAM
- Non-conformable dimensions: Invalid multiplication
Summary
Linear algebra provides the mathematical foundation for:
- Data representation and manipulation
- Model operations (forward/backward pass)
- Optimization algorithms
- Dimensionality reduction techniques
- Understanding model behavior
Master these concepts to build intuition about how machine learning algorithms work at a fundamental level.