Cross-Entropy Loss: The Foundation of Classification

Understand cross-entropy loss through interactive visualizations of probability distributions, gradient flow, and its connection to maximum likelihood estimation.

Best viewed on desktop for optimal interactive experience

Understanding Cross-Entropy Loss

Cross-entropy loss is the cornerstone of modern classification models, measuring the distance between predicted probability distributions and true labels. Rooted in information theory, it quantifies the "surprise" when predicting events based on a model that differs from the true distribution.

This loss function's elegant mathematical properties—smooth gradients, probabilistic interpretation, and connection to maximum likelihood—make it the default choice for training neural networks on classification tasks.

Interactive Cross-Entropy Explorer

Visualize how cross-entropy loss behaves with different probability distributions and class predictions:

Logits (Raw Scores)

Probability Distribution

Loss Calculation

Cross-Entropy Loss:0.4644
-log(p_0) = -log(0.629)
Per-Class Contributions:
Cat:0.4644
Dog:0.0000
Bird:0.0000

Gradients & Updates

∂L/∂z (Gradients):
Cat:-0.3715
Dog:+0.2312
Bird:+0.1402
Gradient = p - y
Where y = 1 for true class, 0 otherwise

Temperature Scaling

  • • T < 1: Sharper distribution (more confident)
  • • T = 1: Standard softmax
  • • T > 1: Softer distribution (less confident)
  • • Used in knowledge distillation

Label Smoothing

  • • Prevents overconfidence
  • • Improves generalization
  • • True class: 1 - ε + ε/K
  • • Other classes: ε/K

Mathematical Formulation

Softmax Function:
p_i = exp(z_i/T) / Σ_j exp(z_j/T)
Cross-Entropy Loss:
L = -Σ_i y_i * log(p_i)

Information Theory Foundation

Entropy

Entropy measures the average information content or uncertainty in a distribution:

H(p) = -Σi=1n pi log(pi)

Where:

  • pi is the probability of event i
  • Higher entropy = more uncertainty
  • Uniform distribution has maximum entropy

Cross-Entropy

Cross-entropy measures the average number of bits needed to encode events from distribution p using a coding scheme optimized for distribution q:

H(p, q) = -Σi=1n pi log(qi)

Key insight: H(p, q) ≥ H(p) with equality when p = q

KL Divergence

The Kullback-Leibler divergence measures the "extra bits" needed:

DKL(p ∥ q) = H(p, q) - H(p) = Σi=1n pi log(piqi)

Cross-Entropy Loss in Machine Learning

Binary Cross-Entropy

For binary classification with y ∈ \{0, 1\} and predicted probability :

ℒ = -y log(ŷ) - (1-y) log(1-ŷ)

Python implementation:

def binary_cross_entropy(y_true, y_pred, epsilon=1e-7): # Clip predictions to prevent log(0) y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Calculate loss loss = -y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred) return np.mean(loss) # Example y_true = np.array([1, 0, 1, 1, 0]) y_pred = np.array([0.9, 0.1, 0.8, 0.6, 0.3]) loss = binary_cross_entropy(y_true, y_pred) print(f"Binary CE Loss: {loss:.4f}")

Categorical Cross-Entropy

For multi-class classification with one-hot encoded labels:

ℒ = -Σi=1C yi log(ŷi)

Where:

  • C is the number of classes
  • yi is 1 for the true class, 0 otherwise
  • i is the predicted probability for class i
def categorical_cross_entropy(y_true, y_pred, epsilon=1e-7): # y_true: one-hot encoded labels [batch_size, num_classes] # y_pred: predicted probabilities [batch_size, num_classes] y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Calculate loss for each sample loss = -np.sum(y_true * np.log(y_pred), axis=1) return np.mean(loss) # Example y_true = np.array([[0, 1, 0], [1, 0, 0], [0, 0, 1]]) # Classes: 1, 0, 2 y_pred = np.array([[0.2, 0.7, 0.1], [0.8, 0.1, 0.1], [0.3, 0.3, 0.4]]) loss = categorical_cross_entropy(y_true, y_pred)

Sparse Categorical Cross-Entropy

When labels are integers instead of one-hot vectors:

def sparse_categorical_cross_entropy(y_true, y_pred, epsilon=1e-7): # y_true: integer labels [batch_size] # y_pred: predicted probabilities [batch_size, num_classes] y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Extract probability of true class for each sample batch_size = y_true.shape[0] loss = -np.log(y_pred[np.arange(batch_size), y_true]) return np.mean(loss)

Softmax and Cross-Entropy

The Softmax Function

Converts raw logits to probabilities:

\text{softmax}(zi) = eziΣj=1C ezj

Properties:

  • Output sums to 1
  • Preserves order (monotonic)
  • Differentiable everywhere
def softmax(logits): # Subtract max for numerical stability exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True)) return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True) # Stable log-softmax computation def log_softmax(logits): max_logits = np.max(logits, axis=-1, keepdims=True) return logits - max_logits - np.log(np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True))

Combined Softmax + Cross-Entropy

For numerical stability, combine softmax and CE:

ℒ = -log(ezyΣj ezj) = -zy + logΣj ezj
def softmax_cross_entropy(logits, labels): # Compute log-sum-exp term max_logits = np.max(logits, axis=-1, keepdims=True) log_sum_exp = np.log(np.sum(np.exp(logits - max_logits), axis=-1)) log_sum_exp += max_logits.squeeze() # Extract logits for true classes batch_size = labels.shape[0] true_class_logits = logits[np.arange(batch_size), labels] # Compute loss loss = -true_class_logits + log_sum_exp return np.mean(loss)

Gradients and Backpropagation

Gradient of Softmax Cross-Entropy

The gradient with respect to logits has an elegant form:

∂ ℒ∂ zi = pi - yi

Where:

  • pi = \text{softmax}(zi) is the predicted probability
  • yi is the true label (one-hot)

This beautiful result shows the gradient is simply the difference between prediction and target!

def softmax_cross_entropy_backward(logits, labels): # Forward pass probs = softmax(logits) # Backward pass - gradient w.r.t. logits grad = probs.copy() # Subtract 1 from true class probabilities batch_size = labels.shape[0] if len(labels.shape) == 1: # Sparse labels grad[np.arange(batch_size), labels] -= 1 else: # One-hot labels grad -= labels # Average over batch grad /= batch_size return grad

Binary Cross-Entropy Gradient

For binary classification with sigmoid activation:

∂ ℒ∂ z = σ(z) - y
def sigmoid(z): return 1 / (1 + np.exp(-z)) def binary_cross_entropy_backward(logits, labels): probs = sigmoid(logits) return (probs - labels) / labels.shape[0]

Label Smoothing

Regularization technique that prevents overconfidence:

y'i = (1 - ε) · yi + εC

Where ε is the smoothing parameter (typically 0.1).

def label_smoothing(labels, epsilon=0.1, num_classes=None): if len(labels.shape) == 1: # Convert sparse to one-hot num_classes = num_classes or labels.max() + 1 one_hot = np.eye(num_classes)[labels] else: one_hot = labels num_classes = labels.shape[1] # Apply smoothing smooth_labels = one_hot * (1 - epsilon) + epsilon / num_classes return smooth_labels # Example labels = np.array([0, 1, 2]) # Sparse labels smooth = label_smoothing(labels, epsilon=0.1, num_classes=3) print(smooth) # [[0.9 0.05 0.05] # [0.05 0.9 0.05] # [0.05 0.05 0.9 ]]

Focal Loss

Addresses class imbalance by down-weighting easy examples:

\text{focal} = -αt (1 - pt)^γ log(pt)

Where:

  • pt is the model's estimated probability for the true class
  • γ is the focusing parameter (typically 2)
  • αt is the weighting factor
def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25): # For binary classification epsilon = 1e-7 y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Calculate focal loss p_t = np.where(y_true == 1, y_pred, 1 - y_pred) alpha_t = np.where(y_true == 1, alpha, 1 - alpha) loss = -alpha_t * (1 - p_t) ** gamma * np.log(p_t) return np.mean(loss)

Connection to Maximum Likelihood

Cross-entropy loss is equivalent to negative log-likelihood under the categorical distribution:

ℒ = -log P(y | x; θ) = -log Πi=1n piyi = -Σi=1n yi log pi

This connects to maximum likelihood estimation (MLE):

def negative_log_likelihood(predictions, labels): # Same as categorical cross-entropy! return categorical_cross_entropy(labels, predictions) # MLE seeks to minimize NLL, which is equivalent to minimizing CE

Temperature Scaling

Control the "sharpness" of the softmax distribution:

\text{softmax}(zi / T) = ezi / TΣj ezj / T
def softmax_with_temperature(logits, temperature=1.0): return softmax(logits / temperature) # Examples logits = np.array([2.0, 1.0, 0.1]) # High temperature (T=5) → more uniform soft_probs = softmax_with_temperature(logits, 5.0) print(f"T=5.0: {soft_probs}") # [0.387, 0.342, 0.271] # Low temperature (T=0.1) → more peaked sharp_probs = softmax_with_temperature(logits, 0.1) print(f"T=0.1: {sharp_probs}") # [0.999, 0.001, 0.000]

Multi-Label Classification

When samples can belong to multiple classes:

ℒ = -1CΣi=1C [yi log(σ(zi)) + (1-yi) log(1-σ(zi))]
def multi_label_cross_entropy(logits, labels): # Apply sigmoid to each logit independently probs = sigmoid(logits) # Binary cross-entropy for each class loss = -(labels * np.log(probs + 1e-7) + (1 - labels) * np.log(1 - probs + 1e-7)) return np.mean(loss)

Numerical Stability Considerations

Log-Sum-Exp Trick

Prevent overflow in softmax computation:

def stable_log_sum_exp(x): max_x = np.max(x) return max_x + np.log(np.sum(np.exp(x - max_x))) # Use in cross-entropy computation def stable_cross_entropy(logits, labels): log_sum_exp = stable_log_sum_exp(logits) log_probs = logits - log_sum_exp # Extract log probability of true class true_class_log_prob = log_probs[labels] return -true_class_log_prob

Gradient Clipping

Prevent exploding gradients:

def clip_gradients(gradients, max_norm=1.0): total_norm = np.sqrt(sum(np.sum(g ** 2) for g in gradients)) if total_norm > max_norm: scale = max_norm / total_norm gradients = [g * scale for g in gradients] return gradients

PyTorch Implementation

import torch import torch.nn as nn import torch.nn.functional as F # Binary cross-entropy criterion = nn.BCEWithLogitsLoss() # Includes sigmoid loss = criterion(logits, targets) # Multi-class cross-entropy criterion = nn.CrossEntropyLoss() # Includes softmax loss = criterion(logits, targets) # targets are class indices # With label smoothing class LabelSmoothingCrossEntropy(nn.Module): def __init__(self, epsilon=0.1): super().__init__() self.epsilon = epsilon def forward(self, logits, targets): n_classes = logits.size(-1) log_probs = F.log_softmax(logits, dim=-1) # Standard cross-entropy loss = F.nll_loss(log_probs, targets, reduction='none') # Label smoothing: add uniform distribution smooth_loss = -log_probs.mean(dim=-1) loss = (1 - self.epsilon) * loss + self.epsilon * smooth_loss return loss.mean()

Visualization Tips

Loss Landscape

def visualize_loss_landscape(true_label=1, n_classes=3): import matplotlib.pyplot as plt probs = np.linspace(0.001, 0.999, 100) plt.figure(figsize=(10, 6)) # Binary CE if n_classes == 2: loss_true = -np.log(probs) loss_false = -np.log(1 - probs) plt.plot(probs, loss_true, label='True class (y=1)') plt.plot(probs, loss_false, label='False class (y=0)') else: # Multi-class CE for i in range(n_classes): if i == true_label: loss = -np.log(probs) else: # Assuming uniform distribution over other classes loss = -np.log((1 - probs) / (n_classes - 1)) plt.plot(probs, loss, label=f'Class {i}') plt.xlabel('Predicted Probability') plt.ylabel('Loss') plt.title('Cross-Entropy Loss Landscape') plt.legend() plt.grid(True, alpha=0.3) plt.ylim(0, 5) plt.show()

Common Pitfalls

  1. Numerical Instability: Always use combined softmax + CE implementations
  2. Wrong Reduction: Ensure proper averaging over batch and classes
  3. Label Format: Match loss function to label format (one-hot vs sparse)
  4. Class Imbalance: Consider weighted CE or focal loss
  5. Probability Clipping: Prevent log(0) errors

Understanding cross-entropy loss connects to:

  • Softmax & Attention: Softmax normalization
  • Gradient Flow: Backpropagation through CE
  • Information Theory: Entropy, mutual information
  • Optimization: Loss landscape navigation
  • Calibration: Confidence vs accuracy alignment

Conclusion

Cross-entropy loss elegantly bridges information theory and machine learning, providing a principled way to train classification models. Its smooth gradients, probabilistic interpretation, and numerical stability make it the foundation of modern deep learning. Understanding its mathematical properties and implementation details is crucial for building and debugging classification systems.

If you found this explanation helpful, consider sharing it with others.

Mastodon