MSE and MAE Loss Functions
Understand Mean Squared Error (MSE) and Mean Absolute Error (MAE), the fundamental loss functions for regression tasks with different sensitivity to outliers.
Best viewed on desktop for optimal interactive experience
MSE and MAE: Fundamental Regression Losses
Mean Squared Error (MSE) and Mean Absolute Error (MAE) are the two most fundamental loss functions for regression tasks, each with distinct properties that make them suitable for different scenarios.
Interactive Loss Comparison
Explore how MSE and MAE respond differently to errors and outliers:
Regression Loss Functions
Compare MSE and MAE loss functions for regression tasks
Mean Squared Error
Individual Errors
Point | Actual | Predicted | Error | Squared Error | Absolute Error |
---|
Properties Comparison
MSE Properties
- • Differentiable everywhere
- • Heavily penalizes large errors
- • Sensitive to outliers
- • Unique minimum solution
- • Smooth optimization landscape
MAE Properties
- • Non-differentiable at zero
- • Linear penalty for all errors
- • Robust to outliers
- • May have multiple minima
- • Median-unbiased estimator
Mean Squared Error (MSE)
Definition
MSE measures the average squared difference between predictions and targets:
Where:
- yi is the actual value
- ŷi is the predicted value
- n is the number of samples
Properties
- Quadratic Penalty: Errors are squared, heavily penalizing large deviations
- Differentiable: Smooth gradient everywhere
- Unique Minimum: Convex optimization landscape
- Mean-Unbiased: Minimizes expected squared error
- Outlier Sensitive: Large errors dominate the loss
Gradient
The gradient with respect to predictions:
Mean Absolute Error (MAE)
Definition
MAE measures the average absolute difference between predictions and targets:
Properties
- Linear Penalty: All errors weighted equally
- Non-Differentiable: Gradient undefined at zero
- Multiple Minima: May have non-unique solutions
- Median-Unbiased: Minimizes expected absolute error
- Outlier Robust: Less affected by extreme values
Gradient
The gradient (subgradient at zero):
Implementation
PyTorch Implementation
import torch import torch.nn as nn import torch.nn.functional as F class MSELoss(nn.Module): """Mean Squared Error Loss""" def __init__(self, reduction='mean'): super().__init__() self.reduction = reduction def forward(self, predictions, targets): squared_errors = (predictions - targets) ** 2 if self.reduction == 'mean': return squared_errors.mean() elif self.reduction == 'sum': return squared_errors.sum() else: # 'none' return squared_errors class MAELoss(nn.Module): """Mean Absolute Error Loss (L1 Loss)""" def __init__(self, reduction='mean'): super().__init__() self.reduction = reduction def forward(self, predictions, targets): absolute_errors = torch.abs(predictions - targets) if self.reduction == 'mean': return absolute_errors.mean() elif self.reduction == 'sum': return absolute_errors.sum() else: # 'none' return absolute_errors # Using PyTorch's built-in losses mse_loss = nn.MSELoss() mae_loss = nn.L1Loss() # Example usage predictions = torch.tensor([2.5, 0.0, 2.1, 7.8]) targets = torch.tensor([3.0, -0.5, 2.0, 8.0]) mse = mse_loss(predictions, targets) mae = mae_loss(predictions, targets) print(f"MSE: {mse:.4f}") # MSE: 0.0850 print(f"MAE: {mae:.4f}") # MAE: 0.2000
TensorFlow/Keras Implementation
import tensorflow as tf from tensorflow import keras # Built-in losses mse = keras.losses.MeanSquaredError() mae = keras.losses.MeanAbsoluteError() # Custom implementations def custom_mse(y_true, y_pred): """Custom MSE implementation""" return tf.reduce_mean(tf.square(y_true - y_pred)) def custom_mae(y_true, y_pred): """Custom MAE implementation""" return tf.reduce_mean(tf.abs(y_true - y_pred)) # In model compilation model.compile( optimizer='adam', loss='mse', # or 'mae', or custom function metrics=['mae', 'mse'] )
NumPy Implementation
import numpy as np def mse(y_true, y_pred): """Mean Squared Error""" return np.mean((y_true - y_pred) ** 2) def mae(y_true, y_pred): """Mean Absolute Error""" return np.mean(np.abs(y_true - y_pred)) def rmse(y_true, y_pred): """Root Mean Squared Error""" return np.sqrt(mse(y_true, y_pred)) def mape(y_true, y_pred): """Mean Absolute Percentage Error""" epsilon = 1e-10 # Avoid division by zero return np.mean(np.abs((y_true - y_pred) / (y_true + epsilon))) * 100 # Example with outliers y_true = np.array([1, 2, 3, 4, 100]) # 100 is an outlier y_pred = np.array([1.1, 2.1, 2.9, 3.9, 5]) print(f"MSE: {mse(y_true, y_pred):.2f}") # MSE: 1806.03 (dominated by outlier) print(f"MAE: {mae(y_true, y_pred):.2f}") # MAE: 19.04 (less affected)
Variants and Extensions
Root Mean Squared Error (RMSE)
RMSE provides error in the same units as the target:
def rmse_loss(predictions, targets): mse = F.mse_loss(predictions, targets) return torch.sqrt(mse)
Huber Loss (Smooth L1)
Combines MSE and MAE benefits:
class HuberLoss(nn.Module): """Huber loss - robust to outliers while differentiable""" def __init__(self, delta=1.0): super().__init__() self.delta = delta def forward(self, predictions, targets): errors = predictions - targets abs_errors = torch.abs(errors) quadratic = torch.min(abs_errors, torch.tensor(self.delta)) linear = abs_errors - quadratic return torch.mean(0.5 * quadratic ** 2 + self.delta * linear) # Using PyTorch's built-in huber_loss = nn.SmoothL1Loss() # delta=1.0
Log-Cosh Loss
Smooth approximation of MAE:
def log_cosh_loss(predictions, targets): """Log-cosh loss - smooth and robust""" errors = predictions - targets return torch.mean(torch.log(torch.cosh(errors)))
When to Use Each Loss
Use MSE When:
- Large errors are particularly bad: Financial predictions, safety-critical systems
- Smooth optimization needed: Gradient-based optimization
- Gaussian noise assumed: Errors follow normal distribution
- Mean estimation desired: Want to predict expected value
Use MAE When:
- Outliers are present: Noisy real-world data
- All errors equally important: No preference for error magnitude
- Median estimation preferred: Robust central tendency
- Interpretability matters: Error in original units
Practical Considerations
Outlier Handling
def robust_training(model, data_loader, epochs=100): """Train with outlier detection and adaptive loss""" mse_criterion = nn.MSELoss(reduction='none') mae_criterion = nn.L1Loss() optimizer = torch.optim.Adam(model.parameters()) for epoch in range(epochs): for inputs, targets in data_loader: predictions = model(inputs) # Calculate per-sample MSE mse_losses = mse_criterion(predictions, targets) # Detect outliers (top 10% errors) threshold = torch.quantile(mse_losses, 0.9) is_outlier = mse_losses > threshold # Use MAE for outliers, MSE for normal samples loss = torch.where( is_outlier, torch.abs(predictions - targets), mse_losses ).mean() optimizer.zero_grad() loss.backward() optimizer.step()
Multi-Scale Losses
class MultiScaleLoss(nn.Module): """Combine MSE and MAE at different scales""" def __init__(self, mse_weight=0.7, mae_weight=0.3): super().__init__() self.mse_weight = mse_weight self.mae_weight = mae_weight self.mse = nn.MSELoss() self.mae = nn.L1Loss() def forward(self, predictions, targets): # Global loss global_mse = self.mse(predictions, targets) global_mae = self.mae(predictions, targets) # Local loss (differences) pred_diff = predictions[1:] - predictions[:-1] target_diff = targets[1:] - targets[:-1] local_mse = self.mse(pred_diff, target_diff) return (self.mse_weight * global_mse + self.mae_weight * global_mae + 0.1 * local_mse)
Weighted Losses
def weighted_mse_loss(predictions, targets, weights): """MSE with sample weights""" squared_errors = (predictions - targets) ** 2 weighted_errors = squared_errors * weights return weighted_errors.mean() def importance_weighted_mae(predictions, targets, importance): """MAE with importance weighting""" absolute_errors = torch.abs(predictions - targets) weighted_errors = absolute_errors * importance return weighted_errors.sum() / importance.sum()
Gradient Behavior
MSE Gradient Properties
def analyze_mse_gradient(error_range=(-5, 5)): """Analyze MSE gradient behavior""" errors = torch.linspace(error_range[0], error_range[1], 100) # MSE gradient: 2 * error mse_gradients = 2 * errors # Gradient magnitude increases linearly with error return errors, mse_gradients def analyze_mae_gradient(error_range=(-5, 5)): """Analyze MAE gradient behavior""" errors = torch.linspace(error_range[0], error_range[1], 100) # MAE gradient: sign(error) mae_gradients = torch.sign(errors) # Constant gradient magnitude (except at 0) return errors, mae_gradients
Optimization Landscape
Convergence Comparison
def compare_convergence(X, y, learning_rates=[0.01, 0.1, 1.0]): """Compare MSE vs MAE convergence""" results = {} for lr in learning_rates: # MSE optimization w_mse = torch.randn(X.shape[1], requires_grad=True) mse_losses = [] for _ in range(100): pred = X @ w_mse loss = F.mse_loss(pred, y) loss.backward() with torch.no_grad(): w_mse -= lr * w_mse.grad w_mse.grad.zero_() mse_losses.append(loss.item()) # MAE optimization w_mae = torch.randn(X.shape[1], requires_grad=True) mae_losses = [] for _ in range(100): pred = X @ w_mae loss = F.l1_loss(pred, y) loss.backward() with torch.no_grad(): w_mae -= lr * w_mae.grad w_mae.grad.zero_() mae_losses.append(loss.item()) results[lr] = { 'mse': mse_losses, 'mae': mae_losses } return results
Common Pitfalls
1. Scale Sensitivity
# Problem: Different feature scales affect MSE disproportionately features = torch.tensor([[1.0, 1000.0], [2.0, 2000.0]]) targets = torch.tensor([1.5, 2.5]) # Solution: Normalize features from sklearn.preprocessing import StandardScaler scaler = StandardScaler() normalized_features = scaler.fit_transform(features)
2. Gradient Explosion with MSE
# Problem: Large errors cause gradient explosion large_error = 1000.0 mse_gradient = 2 * large_error # 2000! # Solution: Gradient clipping torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
3. MAE Non-Smoothness
# Problem: MAE gradient undefined at zero def smooth_l1_loss(predictions, targets, beta=1.0): """Smooth approximation of L1 loss""" diff = predictions - targets abs_diff = torch.abs(diff) return torch.where( abs_diff < beta, 0.5 * diff ** 2 / beta, abs_diff - 0.5 * beta ).mean()
Evaluation Metrics
Beyond Training Loss
def comprehensive_evaluation(predictions, targets): """Calculate multiple regression metrics""" mse = F.mse_loss(predictions, targets) mae = F.l1_loss(predictions, targets) rmse = torch.sqrt(mse) # R-squared ss_res = torch.sum((targets - predictions) ** 2) ss_tot = torch.sum((targets - targets.mean()) ** 2) r2 = 1 - (ss_res / ss_tot) # Mean Absolute Percentage Error mape = torch.mean(torch.abs((targets - predictions) / (targets + 1e-10))) * 100 return { 'MSE': mse.item(), 'MAE': mae.item(), 'RMSE': rmse.item(), 'R²': r2.item(), 'MAPE': mape.item() }
Related Concepts
- Focal Loss - For imbalanced classification
- Contrastive Loss - For representation learning
- KL Divergence - For distribution matching
- Cross-Entropy Loss - For classification
- Gradient Descent - Optimization with these losses