Implementing Softmax From Scratch: Avoiding the Numerical Stability Trap

Implementing Naive Softmax

The Softmax activation function transforms raw neural network scores into a probability distribution, crucial for multi-class classification tasks. However, a naive implementation of Softmax can lead to numerical instability, causing training failures.

This function is mathematically correct but prone to overflow and underflow, especially with extreme logit values, resulting in NaN gradients and halting training.

Why This Matters

Ideal models assume infinite precision, but real-world computers have finite limits. Large logits can exceed the maximum representable number during exponentiation, causing overflow, while small logits can underflow to zero. This instability, if unaddressed, can lead to complete training failure, costing significant compute resources and development time.

Key Insights

Numerical Instability: Large logits can cause overflow and underflow in Softmax calculations.
LogSumExp Trick: Shifting logits and using LogSumExp avoids overflow/underflow by operating in the log domain.
Stable Cross-Entropy: Fused cross-entropy loss implementations (like those in PyTorch and TensorFlow) directly address numerical stability.

Working Example

import torch

def softmax_naive(logits):
    exp_logits = torch.exp(logits)
    return exp_logits / exp_logits.sum(dim=1, keepdim=True)

def stable_cross_entropy(logits, targets):
    # Find max logit per sample
    max_logits, _ = torch.max(logits, dim=1, keepdim=True)
    # Shift logits for numerical stability
    shifted_logits = logits - max_logits
    # Compute LogSumExp
    log_sum_exp = torch.log(torch.sum(torch.exp(shifted_logits), dim=1)) + max_logits.squeeze(1)
    # Compute loss using ORIGINAL logits
    loss = log_sum_exp - logits[torch.arange(len(targets)), targets]
    return loss.mean()

# Sample Logits and Target Labels
logits = torch.tensor([
    [2.0, 1.0, 0.1],
    [1000.0, 1.0, -1000.0],
    [3.0, 2.0, 1.0]
], requires_grad=True)
targets = torch.tensor([0, 2, 1])

# Forward pass with naive Softmax
probs = softmax_naive(logits)
print("Softmax probabilities (naive):")
print(probs)

# Compute stable loss
loss = stable_cross_entropy(logits, targets)
print("\nStable loss:")
print(loss)

loss.backward()
print("\nGradients:")
print(logits.grad)

Practical Applications

Image Classification: Stable Softmax and cross-entropy are essential for training large image classification models like ResNet or EfficientNet.
Pitfall: Separating Softmax and cross-entropy into distinct operations can introduce numerical instability, leading to failed training runs.

References:

https://www.marktechpost.com/2026/01/06/implementing-softmax-from-scratch-avoiding-the-numerical-stability-trap/

On This Page

Implementing Naive Softmax

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Building an End-to-End Data Engineering and Machine Learning Pipeline with PySpark in Google Colab

Offline vs Online Data Augmentation for Machine Learning

Understanding the Dataset Behind a Fraud Detection Model