Probability Distributions

7 min

Interactive visualizations of probability distributions used in machine learning.

Best viewed on desktop for optimal interactive experience

Probability Distributions in ML

Probability distributions are everywhere in machine learning - from modeling data to defining loss functions. Explore how different distributions behave and where they're used.

Distribution Settings

Samples: 0
Mean: NaN
Std Dev: NaN

Visualization

Sample histogram
PDF (Probability Density)

Common Distributions

Gaussian (Normal) Distribution

The most important distribution in ML:

p(x)=12πσ2e(xμ)22σ2p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

import numpy as np from scipy.stats import norm # Parameters mu = 0 # Mean sigma = 1 # Standard deviation # Generate samples samples = np.random.normal(mu, sigma, 1000) # PDF and CDF x = np.linspace(-4, 4, 100) pdf = norm.pdf(x, mu, sigma) cdf = norm.cdf(x, mu, sigma)

Applications:

  • Weight initialization
  • Gaussian noise for regularization
  • VAE latent space
  • Gaussian processes

Bernoulli Distribution

Binary outcomes (success/failure):

P(X=1)=p,P(X=0)=1pP(X=1) = p, \quad P(X=0) = 1-p

from scipy.stats import bernoulli # Coin flip p = 0.7 # Probability of heads samples = bernoulli.rvs(p, size=1000) # Binary classification logits = model(x) probs = torch.sigmoid(logits) # Bernoulli parameter

Applications:

  • Binary classification
  • Dropout regularization
  • Binary masks in attention

Categorical Distribution

Multiple discrete outcomes:

P(X=k)=pk,k=1Kpk=1P(X=k) = p_k, \quad \sum_{k=1}^K p_k = 1

# Softmax for categorical distribution logits = model(x) # [batch_size, num_classes] probs = F.softmax(logits, dim=-1) # Sampling samples = torch.multinomial(probs, num_samples=1) # Cross-entropy loss loss = -torch.sum(targets * torch.log(probs))

Applications:

  • Multi-class classification
  • Language modeling (next token)
  • Policy networks in RL

Uniform Distribution

Equal probability over interval:

p(x)=1ba for x[a,b]p(x) = \frac{1}{b-a} \text{ for } x \in [a,b]

# Uniform sampling uniform_samples = np.random.uniform(0, 1, 1000) # Xavier initialization fan_in, fan_out = layer.weight.shape limit = np.sqrt(6 / (fan_in + fan_out)) weights = np.random.uniform(-limit, limit, shape)

Applications:

  • Random initialization
  • Exploration in RL
  • Data augmentation

Multivariate Distributions

Multivariate Gaussian

p(x)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)

# 2D Gaussian mean = np.array([0, 0]) cov = np.array([[1, 0.5], [0.5, 1]]) samples = np.random.multivariate_normal(mean, cov, 1000) # Diagonal covariance (independent features) cov_diag = np.diag([1, 2])

Applications:

  • Gaussian mixture models
  • Variational autoencoders
  • Kalman filters

Sampling Methods

Inverse Transform Sampling

# Sample from custom distribution using CDF def inverse_transform_sample(cdf_inverse, n_samples): u = np.random.uniform(0, 1, n_samples) return cdf_inverse(u) # Example: Exponential distribution lambda_param = 2 samples = -np.log(1 - np.random.uniform(0, 1, 1000)) / lambda_param

Reparameterization Trick

# For backpropagation through sampling (VAE) def sample_gaussian(mu, log_var): std = torch.exp(0.5 * log_var) eps = torch.randn_like(std) # Standard normal return mu + eps * std # Reparameterized sample

KL Divergence

Measures difference between distributions:

DKL(PQ)=xP(x)logP(x)Q(x)D_{KL}(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}

# KL divergence for Gaussians def gaussian_kl(mu1, sigma1, mu2, sigma2): return np.log(sigma2/sigma1) + (sigma1**2 + (mu1-mu2)**2)/(2*sigma2**2) - 0.5 # In VAE loss kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())

Maximum Likelihood Estimation

Find parameters that maximize data likelihood:

# MLE for Gaussian data = np.array([1.2, 2.3, 1.8, 2.1, 1.9]) mu_mle = np.mean(data) sigma_mle = np.std(data, ddof=0) # Population std # Negative log-likelihood loss def nll_loss(predictions, targets, variance): return 0.5 * (torch.log(2 * np.pi * variance) + (predictions - targets)**2 / variance)

Mixture Models

Combining multiple distributions:

from sklearn.mixture import GaussianMixture # Gaussian Mixture Model gmm = GaussianMixture(n_components=3) gmm.fit(X) # Mixture density def mixture_pdf(x, weights, means, stds): pdf = 0 for w, m, s in zip(weights, means, stds): pdf += w * norm.pdf(x, m, s) return pdf

Common Pitfalls

  1. Numerical instability: Use log probabilities
# Bad: Product of small probabilities prob = np.prod(probs) # Underflow! # Good: Sum of log probabilities log_prob = np.sum(np.log(probs))
  1. Sampling bias: Ensure proper randomness
  2. Parameter constraints: Variance must be positive
  3. Dimensionality: Covariance matrix grows as O(d²)

Next Steps

If you found this explanation helpful, consider sharing it with others.

Mastodon