Cross-Entropy Loss: The Foundation of Classification
Understand cross-entropy loss through interactive visualizations of probability distributions, gradient flow, and its connection to maximum likelihood estimation.
Best viewed on desktop for optimal interactive experience
Understanding Cross-Entropy Loss
Cross-entropy loss is the cornerstone of modern classification models, measuring the distance between predicted probability distributions and true labels. Rooted in information theory, it quantifies the "surprise" when predicting events based on a model that differs from the true distribution.
This loss function's elegant mathematical properties—smooth gradients, probabilistic interpretation, and connection to maximum likelihood—make it the default choice for training neural networks on classification tasks.
Interactive Cross-Entropy Explorer
Visualize how cross-entropy loss behaves with different probability distributions and class predictions:
Logits (Raw Scores)
Probability Distribution
Loss Calculation
Gradients & Updates
Temperature Scaling
- • T < 1: Sharper distribution (more confident)
- • T = 1: Standard softmax
- • T > 1: Softer distribution (less confident)
- • Used in knowledge distillation
Label Smoothing
- • Prevents overconfidence
- • Improves generalization
- • True class: 1 - ε + ε/K
- • Other classes: ε/K
Mathematical Formulation
Softmax Function:
Cross-Entropy Loss:
Information Theory Foundation
Entropy
Entropy measures the average information content or uncertainty in a distribution:
Where:
- pi is the probability of event i
- Higher entropy = more uncertainty
- Uniform distribution has maximum entropy
Cross-Entropy
Cross-entropy measures the average number of bits needed to encode events from distribution p using a coding scheme optimized for distribution q:
Key insight: H(p, q) ≥ H(p) with equality when p = q
KL Divergence
The Kullback-Leibler divergence measures the "extra bits" needed:
Cross-Entropy Loss in Machine Learning
Binary Cross-Entropy
For binary classification with y ∈ \{0, 1\} and predicted probability ŷ:
Python implementation:
def binary_cross_entropy(y_true, y_pred, epsilon=1e-7): # Clip predictions to prevent log(0) y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Calculate loss loss = -y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred) return np.mean(loss) # Example y_true = np.array([1, 0, 1, 1, 0]) y_pred = np.array([0.9, 0.1, 0.8, 0.6, 0.3]) loss = binary_cross_entropy(y_true, y_pred) print(f"Binary CE Loss: {loss:.4f}")
Categorical Cross-Entropy
For multi-class classification with one-hot encoded labels:
Where:
- C is the number of classes
- yi is 1 for the true class, 0 otherwise
- ŷi is the predicted probability for class i
def categorical_cross_entropy(y_true, y_pred, epsilon=1e-7): # y_true: one-hot encoded labels [batch_size, num_classes] # y_pred: predicted probabilities [batch_size, num_classes] y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Calculate loss for each sample loss = -np.sum(y_true * np.log(y_pred), axis=1) return np.mean(loss) # Example y_true = np.array([[0, 1, 0], [1, 0, 0], [0, 0, 1]]) # Classes: 1, 0, 2 y_pred = np.array([[0.2, 0.7, 0.1], [0.8, 0.1, 0.1], [0.3, 0.3, 0.4]]) loss = categorical_cross_entropy(y_true, y_pred)
Sparse Categorical Cross-Entropy
When labels are integers instead of one-hot vectors:
def sparse_categorical_cross_entropy(y_true, y_pred, epsilon=1e-7): # y_true: integer labels [batch_size] # y_pred: predicted probabilities [batch_size, num_classes] y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Extract probability of true class for each sample batch_size = y_true.shape[0] loss = -np.log(y_pred[np.arange(batch_size), y_true]) return np.mean(loss)
Softmax and Cross-Entropy
The Softmax Function
Converts raw logits to probabilities:
Properties:
- Output sums to 1
- Preserves order (monotonic)
- Differentiable everywhere
def softmax(logits): # Subtract max for numerical stability exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True)) return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True) # Stable log-softmax computation def log_softmax(logits): max_logits = np.max(logits, axis=-1, keepdims=True) return logits - max_logits - np.log(np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True))
Combined Softmax + Cross-Entropy
For numerical stability, combine softmax and CE:
def softmax_cross_entropy(logits, labels): # Compute log-sum-exp term max_logits = np.max(logits, axis=-1, keepdims=True) log_sum_exp = np.log(np.sum(np.exp(logits - max_logits), axis=-1)) log_sum_exp += max_logits.squeeze() # Extract logits for true classes batch_size = labels.shape[0] true_class_logits = logits[np.arange(batch_size), labels] # Compute loss loss = -true_class_logits + log_sum_exp return np.mean(loss)
Gradients and Backpropagation
Gradient of Softmax Cross-Entropy
The gradient with respect to logits has an elegant form:
Where:
- pi = \text{softmax}(zi) is the predicted probability
- yi is the true label (one-hot)
This beautiful result shows the gradient is simply the difference between prediction and target!
def softmax_cross_entropy_backward(logits, labels): # Forward pass probs = softmax(logits) # Backward pass - gradient w.r.t. logits grad = probs.copy() # Subtract 1 from true class probabilities batch_size = labels.shape[0] if len(labels.shape) == 1: # Sparse labels grad[np.arange(batch_size), labels] -= 1 else: # One-hot labels grad -= labels # Average over batch grad /= batch_size return grad
Binary Cross-Entropy Gradient
For binary classification with sigmoid activation:
def sigmoid(z): return 1 / (1 + np.exp(-z)) def binary_cross_entropy_backward(logits, labels): probs = sigmoid(logits) return (probs - labels) / labels.shape[0]
Label Smoothing
Regularization technique that prevents overconfidence:
Where ε is the smoothing parameter (typically 0.1).
def label_smoothing(labels, epsilon=0.1, num_classes=None): if len(labels.shape) == 1: # Convert sparse to one-hot num_classes = num_classes or labels.max() + 1 one_hot = np.eye(num_classes)[labels] else: one_hot = labels num_classes = labels.shape[1] # Apply smoothing smooth_labels = one_hot * (1 - epsilon) + epsilon / num_classes return smooth_labels # Example labels = np.array([0, 1, 2]) # Sparse labels smooth = label_smoothing(labels, epsilon=0.1, num_classes=3) print(smooth) # [[0.9 0.05 0.05] # [0.05 0.9 0.05] # [0.05 0.05 0.9 ]]
Focal Loss
Addresses class imbalance by down-weighting easy examples:
Where:
- pt is the model's estimated probability for the true class
- γ is the focusing parameter (typically 2)
- αt is the weighting factor
def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25): # For binary classification epsilon = 1e-7 y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Calculate focal loss p_t = np.where(y_true == 1, y_pred, 1 - y_pred) alpha_t = np.where(y_true == 1, alpha, 1 - alpha) loss = -alpha_t * (1 - p_t) ** gamma * np.log(p_t) return np.mean(loss)
Connection to Maximum Likelihood
Cross-entropy loss is equivalent to negative log-likelihood under the categorical distribution:
This connects to maximum likelihood estimation (MLE):
def negative_log_likelihood(predictions, labels): # Same as categorical cross-entropy! return categorical_cross_entropy(labels, predictions) # MLE seeks to minimize NLL, which is equivalent to minimizing CE
Temperature Scaling
Control the "sharpness" of the softmax distribution:
def softmax_with_temperature(logits, temperature=1.0): return softmax(logits / temperature) # Examples logits = np.array([2.0, 1.0, 0.1]) # High temperature (T=5) → more uniform soft_probs = softmax_with_temperature(logits, 5.0) print(f"T=5.0: {soft_probs}") # [0.387, 0.342, 0.271] # Low temperature (T=0.1) → more peaked sharp_probs = softmax_with_temperature(logits, 0.1) print(f"T=0.1: {sharp_probs}") # [0.999, 0.001, 0.000]
Multi-Label Classification
When samples can belong to multiple classes:
def multi_label_cross_entropy(logits, labels): # Apply sigmoid to each logit independently probs = sigmoid(logits) # Binary cross-entropy for each class loss = -(labels * np.log(probs + 1e-7) + (1 - labels) * np.log(1 - probs + 1e-7)) return np.mean(loss)
Numerical Stability Considerations
Log-Sum-Exp Trick
Prevent overflow in softmax computation:
def stable_log_sum_exp(x): max_x = np.max(x) return max_x + np.log(np.sum(np.exp(x - max_x))) # Use in cross-entropy computation def stable_cross_entropy(logits, labels): log_sum_exp = stable_log_sum_exp(logits) log_probs = logits - log_sum_exp # Extract log probability of true class true_class_log_prob = log_probs[labels] return -true_class_log_prob
Gradient Clipping
Prevent exploding gradients:
def clip_gradients(gradients, max_norm=1.0): total_norm = np.sqrt(sum(np.sum(g ** 2) for g in gradients)) if total_norm > max_norm: scale = max_norm / total_norm gradients = [g * scale for g in gradients] return gradients
PyTorch Implementation
import torch import torch.nn as nn import torch.nn.functional as F # Binary cross-entropy criterion = nn.BCEWithLogitsLoss() # Includes sigmoid loss = criterion(logits, targets) # Multi-class cross-entropy criterion = nn.CrossEntropyLoss() # Includes softmax loss = criterion(logits, targets) # targets are class indices # With label smoothing class LabelSmoothingCrossEntropy(nn.Module): def __init__(self, epsilon=0.1): super().__init__() self.epsilon = epsilon def forward(self, logits, targets): n_classes = logits.size(-1) log_probs = F.log_softmax(logits, dim=-1) # Standard cross-entropy loss = F.nll_loss(log_probs, targets, reduction='none') # Label smoothing: add uniform distribution smooth_loss = -log_probs.mean(dim=-1) loss = (1 - self.epsilon) * loss + self.epsilon * smooth_loss return loss.mean()
Visualization Tips
Loss Landscape
def visualize_loss_landscape(true_label=1, n_classes=3): import matplotlib.pyplot as plt probs = np.linspace(0.001, 0.999, 100) plt.figure(figsize=(10, 6)) # Binary CE if n_classes == 2: loss_true = -np.log(probs) loss_false = -np.log(1 - probs) plt.plot(probs, loss_true, label='True class (y=1)') plt.plot(probs, loss_false, label='False class (y=0)') else: # Multi-class CE for i in range(n_classes): if i == true_label: loss = -np.log(probs) else: # Assuming uniform distribution over other classes loss = -np.log((1 - probs) / (n_classes - 1)) plt.plot(probs, loss, label=f'Class {i}') plt.xlabel('Predicted Probability') plt.ylabel('Loss') plt.title('Cross-Entropy Loss Landscape') plt.legend() plt.grid(True, alpha=0.3) plt.ylim(0, 5) plt.show()
Common Pitfalls
- Numerical Instability: Always use combined softmax + CE implementations
- Wrong Reduction: Ensure proper averaging over batch and classes
- Label Format: Match loss function to label format (one-hot vs sparse)
- Class Imbalance: Consider weighted CE or focal loss
- Probability Clipping: Prevent log(0) errors
Related Concepts
Understanding cross-entropy loss connects to:
- Softmax & Attention: Softmax normalization
- Gradient Flow: Backpropagation through CE
- Information Theory: Entropy, mutual information
- Optimization: Loss landscape navigation
- Calibration: Confidence vs accuracy alignment
Conclusion
Cross-entropy loss elegantly bridges information theory and machine learning, providing a principled way to train classification models. Its smooth gradients, probabilistic interpretation, and numerical stability make it the foundation of modern deep learning. Understanding its mathematical properties and implementation details is crucial for building and debugging classification systems.