MSE and MAE Loss Functions

Understand Mean Squared Error (MSE) and Mean Absolute Error (MAE), the fundamental loss functions for regression tasks with different sensitivity to outliers.

Best viewed on desktop for optimal interactive experience

MSE and MAE: Fundamental Regression Losses

Mean Squared Error (MSE) and Mean Absolute Error (MAE) are the two most fundamental loss functions for regression tasks, each with distinct properties that make them suitable for different scenarios.

Interactive Loss Comparison

Explore how MSE and MAE respond differently to errors and outliers:

Regression Loss Functions

Compare MSE and MAE loss functions for regression tasks

MSE
0.000
Penalizes large errors
MAE
0.000
Robust to outliers
RMSE
0.000
Same units as target

Mean Squared Error

MSE = (1/n) Σ(y - ŷ)²

Individual Errors

PointActualPredictedErrorSquared ErrorAbsolute Error

Properties Comparison

MSE Properties

  • • Differentiable everywhere
  • • Heavily penalizes large errors
  • • Sensitive to outliers
  • • Unique minimum solution
  • • Smooth optimization landscape

MAE Properties

  • • Non-differentiable at zero
  • • Linear penalty for all errors
  • • Robust to outliers
  • • May have multiple minima
  • • Median-unbiased estimator

Mean Squared Error (MSE)

Definition

MSE measures the average squared difference between predictions and targets:

MSE = 1n Σi=1n (yi - ŷi)2

Where:

  • yi is the actual value
  • i is the predicted value
  • n is the number of samples

Properties

  1. Quadratic Penalty: Errors are squared, heavily penalizing large deviations
  2. Differentiable: Smooth gradient everywhere
  3. Unique Minimum: Convex optimization landscape
  4. Mean-Unbiased: Minimizes expected squared error
  5. Outlier Sensitive: Large errors dominate the loss

Gradient

The gradient with respect to predictions:

∂ MSE∂ ŷi = 2n(ŷi - yi)

Mean Absolute Error (MAE)

Definition

MAE measures the average absolute difference between predictions and targets:

MAE = 1n Σi=1n |yi - ŷi|

Properties

  1. Linear Penalty: All errors weighted equally
  2. Non-Differentiable: Gradient undefined at zero
  3. Multiple Minima: May have non-unique solutions
  4. Median-Unbiased: Minimizes expected absolute error
  5. Outlier Robust: Less affected by extreme values

Gradient

The gradient (subgradient at zero):

∂ MAE∂ ŷi = 1n · \text{sign}(ŷi - yi)

Implementation

PyTorch Implementation

import torch import torch.nn as nn import torch.nn.functional as F class MSELoss(nn.Module): """Mean Squared Error Loss""" def __init__(self, reduction='mean'): super().__init__() self.reduction = reduction def forward(self, predictions, targets): squared_errors = (predictions - targets) ** 2 if self.reduction == 'mean': return squared_errors.mean() elif self.reduction == 'sum': return squared_errors.sum() else: # 'none' return squared_errors class MAELoss(nn.Module): """Mean Absolute Error Loss (L1 Loss)""" def __init__(self, reduction='mean'): super().__init__() self.reduction = reduction def forward(self, predictions, targets): absolute_errors = torch.abs(predictions - targets) if self.reduction == 'mean': return absolute_errors.mean() elif self.reduction == 'sum': return absolute_errors.sum() else: # 'none' return absolute_errors # Using PyTorch's built-in losses mse_loss = nn.MSELoss() mae_loss = nn.L1Loss() # Example usage predictions = torch.tensor([2.5, 0.0, 2.1, 7.8]) targets = torch.tensor([3.0, -0.5, 2.0, 8.0]) mse = mse_loss(predictions, targets) mae = mae_loss(predictions, targets) print(f"MSE: {mse:.4f}") # MSE: 0.0850 print(f"MAE: {mae:.4f}") # MAE: 0.2000

TensorFlow/Keras Implementation

import tensorflow as tf from tensorflow import keras # Built-in losses mse = keras.losses.MeanSquaredError() mae = keras.losses.MeanAbsoluteError() # Custom implementations def custom_mse(y_true, y_pred): """Custom MSE implementation""" return tf.reduce_mean(tf.square(y_true - y_pred)) def custom_mae(y_true, y_pred): """Custom MAE implementation""" return tf.reduce_mean(tf.abs(y_true - y_pred)) # In model compilation model.compile( optimizer='adam', loss='mse', # or 'mae', or custom function metrics=['mae', 'mse'] )

NumPy Implementation

import numpy as np def mse(y_true, y_pred): """Mean Squared Error""" return np.mean((y_true - y_pred) ** 2) def mae(y_true, y_pred): """Mean Absolute Error""" return np.mean(np.abs(y_true - y_pred)) def rmse(y_true, y_pred): """Root Mean Squared Error""" return np.sqrt(mse(y_true, y_pred)) def mape(y_true, y_pred): """Mean Absolute Percentage Error""" epsilon = 1e-10 # Avoid division by zero return np.mean(np.abs((y_true - y_pred) / (y_true + epsilon))) * 100 # Example with outliers y_true = np.array([1, 2, 3, 4, 100]) # 100 is an outlier y_pred = np.array([1.1, 2.1, 2.9, 3.9, 5]) print(f"MSE: {mse(y_true, y_pred):.2f}") # MSE: 1806.03 (dominated by outlier) print(f"MAE: {mae(y_true, y_pred):.2f}") # MAE: 19.04 (less affected)

Variants and Extensions

Root Mean Squared Error (RMSE)

RMSE provides error in the same units as the target:

RMSE = √(MSE) = √(\frac{1){n} Σi=1n (yi - ŷi)2}
def rmse_loss(predictions, targets): mse = F.mse_loss(predictions, targets) return torch.sqrt(mse)

Huber Loss (Smooth L1)

Combines MSE and MAE benefits:

Lδ(y, ŷ) = \begin{cases} 12(y - ŷ)2 & \text{if } |y - ŷ| ≤ δ \ δ|y - ŷ| - 12δ2 & \text{otherwise} \end{cases}
class HuberLoss(nn.Module): """Huber loss - robust to outliers while differentiable""" def __init__(self, delta=1.0): super().__init__() self.delta = delta def forward(self, predictions, targets): errors = predictions - targets abs_errors = torch.abs(errors) quadratic = torch.min(abs_errors, torch.tensor(self.delta)) linear = abs_errors - quadratic return torch.mean(0.5 * quadratic ** 2 + self.delta * linear) # Using PyTorch's built-in huber_loss = nn.SmoothL1Loss() # delta=1.0

Log-Cosh Loss

Smooth approximation of MAE:

L(y, ŷ) = Σi=1n log(cosh(ŷi - yi))
def log_cosh_loss(predictions, targets): """Log-cosh loss - smooth and robust""" errors = predictions - targets return torch.mean(torch.log(torch.cosh(errors)))

When to Use Each Loss

Use MSE When:

  1. Large errors are particularly bad: Financial predictions, safety-critical systems
  2. Smooth optimization needed: Gradient-based optimization
  3. Gaussian noise assumed: Errors follow normal distribution
  4. Mean estimation desired: Want to predict expected value

Use MAE When:

  1. Outliers are present: Noisy real-world data
  2. All errors equally important: No preference for error magnitude
  3. Median estimation preferred: Robust central tendency
  4. Interpretability matters: Error in original units

Practical Considerations

Outlier Handling

def robust_training(model, data_loader, epochs=100): """Train with outlier detection and adaptive loss""" mse_criterion = nn.MSELoss(reduction='none') mae_criterion = nn.L1Loss() optimizer = torch.optim.Adam(model.parameters()) for epoch in range(epochs): for inputs, targets in data_loader: predictions = model(inputs) # Calculate per-sample MSE mse_losses = mse_criterion(predictions, targets) # Detect outliers (top 10% errors) threshold = torch.quantile(mse_losses, 0.9) is_outlier = mse_losses > threshold # Use MAE for outliers, MSE for normal samples loss = torch.where( is_outlier, torch.abs(predictions - targets), mse_losses ).mean() optimizer.zero_grad() loss.backward() optimizer.step()

Multi-Scale Losses

class MultiScaleLoss(nn.Module): """Combine MSE and MAE at different scales""" def __init__(self, mse_weight=0.7, mae_weight=0.3): super().__init__() self.mse_weight = mse_weight self.mae_weight = mae_weight self.mse = nn.MSELoss() self.mae = nn.L1Loss() def forward(self, predictions, targets): # Global loss global_mse = self.mse(predictions, targets) global_mae = self.mae(predictions, targets) # Local loss (differences) pred_diff = predictions[1:] - predictions[:-1] target_diff = targets[1:] - targets[:-1] local_mse = self.mse(pred_diff, target_diff) return (self.mse_weight * global_mse + self.mae_weight * global_mae + 0.1 * local_mse)

Weighted Losses

def weighted_mse_loss(predictions, targets, weights): """MSE with sample weights""" squared_errors = (predictions - targets) ** 2 weighted_errors = squared_errors * weights return weighted_errors.mean() def importance_weighted_mae(predictions, targets, importance): """MAE with importance weighting""" absolute_errors = torch.abs(predictions - targets) weighted_errors = absolute_errors * importance return weighted_errors.sum() / importance.sum()

Gradient Behavior

MSE Gradient Properties

def analyze_mse_gradient(error_range=(-5, 5)): """Analyze MSE gradient behavior""" errors = torch.linspace(error_range[0], error_range[1], 100) # MSE gradient: 2 * error mse_gradients = 2 * errors # Gradient magnitude increases linearly with error return errors, mse_gradients def analyze_mae_gradient(error_range=(-5, 5)): """Analyze MAE gradient behavior""" errors = torch.linspace(error_range[0], error_range[1], 100) # MAE gradient: sign(error) mae_gradients = torch.sign(errors) # Constant gradient magnitude (except at 0) return errors, mae_gradients

Optimization Landscape

Convergence Comparison

def compare_convergence(X, y, learning_rates=[0.01, 0.1, 1.0]): """Compare MSE vs MAE convergence""" results = {} for lr in learning_rates: # MSE optimization w_mse = torch.randn(X.shape[1], requires_grad=True) mse_losses = [] for _ in range(100): pred = X @ w_mse loss = F.mse_loss(pred, y) loss.backward() with torch.no_grad(): w_mse -= lr * w_mse.grad w_mse.grad.zero_() mse_losses.append(loss.item()) # MAE optimization w_mae = torch.randn(X.shape[1], requires_grad=True) mae_losses = [] for _ in range(100): pred = X @ w_mae loss = F.l1_loss(pred, y) loss.backward() with torch.no_grad(): w_mae -= lr * w_mae.grad w_mae.grad.zero_() mae_losses.append(loss.item()) results[lr] = { 'mse': mse_losses, 'mae': mae_losses } return results

Common Pitfalls

1. Scale Sensitivity

# Problem: Different feature scales affect MSE disproportionately features = torch.tensor([[1.0, 1000.0], [2.0, 2000.0]]) targets = torch.tensor([1.5, 2.5]) # Solution: Normalize features from sklearn.preprocessing import StandardScaler scaler = StandardScaler() normalized_features = scaler.fit_transform(features)

2. Gradient Explosion with MSE

# Problem: Large errors cause gradient explosion large_error = 1000.0 mse_gradient = 2 * large_error # 2000! # Solution: Gradient clipping torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

3. MAE Non-Smoothness

# Problem: MAE gradient undefined at zero def smooth_l1_loss(predictions, targets, beta=1.0): """Smooth approximation of L1 loss""" diff = predictions - targets abs_diff = torch.abs(diff) return torch.where( abs_diff < beta, 0.5 * diff ** 2 / beta, abs_diff - 0.5 * beta ).mean()

Evaluation Metrics

Beyond Training Loss

def comprehensive_evaluation(predictions, targets): """Calculate multiple regression metrics""" mse = F.mse_loss(predictions, targets) mae = F.l1_loss(predictions, targets) rmse = torch.sqrt(mse) # R-squared ss_res = torch.sum((targets - predictions) ** 2) ss_tot = torch.sum((targets - targets.mean()) ** 2) r2 = 1 - (ss_res / ss_tot) # Mean Absolute Percentage Error mape = torch.mean(torch.abs((targets - predictions) / (targets + 1e-10))) * 100 return { 'MSE': mse.item(), 'MAE': mae.item(), 'RMSE': rmse.item(), 'R²': r2.item(), 'MAPE': mape.item() }

If you found this explanation helpful, consider sharing it with others.

Mastodon