Probability Distributions
Interactive visualizations of probability distributions used in machine learning.
Best viewed on desktop for optimal interactive experience
Probability Distributions in ML
Probability distributions are everywhere in machine learning - from modeling data to defining loss functions. Explore how different distributions behave and where they're used.
Distribution Settings
Samples: 0
Mean: NaN
Std Dev: NaN
Visualization
Sample histogram
PDF (Probability Density)
Common Distributions
Gaussian (Normal) Distribution
The most important distribution in ML:
import numpy as np from scipy.stats import norm # Parameters mu = 0 # Mean sigma = 1 # Standard deviation # Generate samples samples = np.random.normal(mu, sigma, 1000) # PDF and CDF x = np.linspace(-4, 4, 100) pdf = norm.pdf(x, mu, sigma) cdf = norm.cdf(x, mu, sigma)
Applications:
- Weight initialization
- Gaussian noise for regularization
- VAE latent space
- Gaussian processes
Bernoulli Distribution
Binary outcomes (success/failure):
from scipy.stats import bernoulli # Coin flip p = 0.7 # Probability of heads samples = bernoulli.rvs(p, size=1000) # Binary classification logits = model(x) probs = torch.sigmoid(logits) # Bernoulli parameter
Applications:
- Binary classification
- Dropout regularization
- Binary masks in attention
Categorical Distribution
Multiple discrete outcomes:
# Softmax for categorical distribution logits = model(x) # [batch_size, num_classes] probs = F.softmax(logits, dim=-1) # Sampling samples = torch.multinomial(probs, num_samples=1) # Cross-entropy loss loss = -torch.sum(targets * torch.log(probs))
Applications:
- Multi-class classification
- Language modeling (next token)
- Policy networks in RL
Uniform Distribution
Equal probability over interval:
# Uniform sampling uniform_samples = np.random.uniform(0, 1, 1000) # Xavier initialization fan_in, fan_out = layer.weight.shape limit = np.sqrt(6 / (fan_in + fan_out)) weights = np.random.uniform(-limit, limit, shape)
Applications:
- Random initialization
- Exploration in RL
- Data augmentation
Multivariate Distributions
Multivariate Gaussian
# 2D Gaussian mean = np.array([0, 0]) cov = np.array([[1, 0.5], [0.5, 1]]) samples = np.random.multivariate_normal(mean, cov, 1000) # Diagonal covariance (independent features) cov_diag = np.diag([1, 2])
Applications:
- Gaussian mixture models
- Variational autoencoders
- Kalman filters
Sampling Methods
Inverse Transform Sampling
# Sample from custom distribution using CDF def inverse_transform_sample(cdf_inverse, n_samples): u = np.random.uniform(0, 1, n_samples) return cdf_inverse(u) # Example: Exponential distribution lambda_param = 2 samples = -np.log(1 - np.random.uniform(0, 1, 1000)) / lambda_param
Reparameterization Trick
# For backpropagation through sampling (VAE) def sample_gaussian(mu, log_var): std = torch.exp(0.5 * log_var) eps = torch.randn_like(std) # Standard normal return mu + eps * std # Reparameterized sample
KL Divergence
Measures difference between distributions:
# KL divergence for Gaussians def gaussian_kl(mu1, sigma1, mu2, sigma2): return np.log(sigma2/sigma1) + (sigma1**2 + (mu1-mu2)**2)/(2*sigma2**2) - 0.5 # In VAE loss kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
Maximum Likelihood Estimation
Find parameters that maximize data likelihood:
# MLE for Gaussian data = np.array([1.2, 2.3, 1.8, 2.1, 1.9]) mu_mle = np.mean(data) sigma_mle = np.std(data, ddof=0) # Population std # Negative log-likelihood loss def nll_loss(predictions, targets, variance): return 0.5 * (torch.log(2 * np.pi * variance) + (predictions - targets)**2 / variance)
Mixture Models
Combining multiple distributions:
from sklearn.mixture import GaussianMixture # Gaussian Mixture Model gmm = GaussianMixture(n_components=3) gmm.fit(X) # Mixture density def mixture_pdf(x, weights, means, stds): pdf = 0 for w, m, s in zip(weights, means, stds): pdf += w * norm.pdf(x, m, s) return pdf
Common Pitfalls
- Numerical instability: Use log probabilities
# Bad: Product of small probabilities prob = np.prod(probs) # Underflow! # Good: Sum of log probabilities log_prob = np.sum(np.log(probs))
- Sampling bias: Ensure proper randomness
- Parameter constraints: Variance must be positive
- Dimensionality: Covariance matrix grows as O(d²)
Next Steps
- Explore Information Theory
- Learn about Bayes Theorem
- Study Maximum Likelihood