Skip to main content

On This Page

Implementing Softmax From Scratch: Avoiding the Numerical Stability Trap

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Implementing Naive Softmax

The Softmax activation function transforms raw neural network scores into a probability distribution, crucial for multi-class classification tasks. However, a naive implementation of Softmax can lead to numerical instability, causing training failures.

This function is mathematically correct but prone to overflow and underflow, especially with extreme logit values, resulting in NaN gradients and halting training.

Why This Matters

Ideal models assume infinite precision, but real-world computers have finite limits. Large logits can exceed the maximum representable number during exponentiation, causing overflow, while small logits can underflow to zero. This instability, if unaddressed, can lead to complete training failure, costing significant compute resources and development time.

Key Insights

  • Numerical Instability: Large logits can cause overflow and underflow in Softmax calculations.
  • LogSumExp Trick: Shifting logits and using LogSumExp avoids overflow/underflow by operating in the log domain.
  • Stable Cross-Entropy: Fused cross-entropy loss implementations (like those in PyTorch and TensorFlow) directly address numerical stability.

Working Example

import torch

def softmax_naive(logits):
    exp_logits = torch.exp(logits)
    return exp_logits / exp_logits.sum(dim=1, keepdim=True)

def stable_cross_entropy(logits, targets):
    # Find max logit per sample
    max_logits, _ = torch.max(logits, dim=1, keepdim=True)
    # Shift logits for numerical stability
    shifted_logits = logits - max_logits
    # Compute LogSumExp
    log_sum_exp = torch.log(torch.sum(torch.exp(shifted_logits), dim=1)) + max_logits.squeeze(1)
    # Compute loss using ORIGINAL logits
    loss = log_sum_exp - logits[torch.arange(len(targets)), targets]
    return loss.mean()

# Sample Logits and Target Labels
logits = torch.tensor([
    [2.0, 1.0, 0.1],
    [1000.0, 1.0, -1000.0],
    [3.0, 2.0, 1.0]
], requires_grad=True)
targets = torch.tensor([0, 2, 1])

# Forward pass with naive Softmax
probs = softmax_naive(logits)
print("Softmax probabilities (naive):")
print(probs)

# Compute stable loss
loss = stable_cross_entropy(logits, targets)
print("\nStable loss:")
print(loss)

loss.backward()
print("\nGradients:")
print(logits.grad)

Practical Applications

  • Image Classification: Stable Softmax and cross-entropy are essential for training large image classification models like ResNet or EfficientNet.
  • Pitfall: Separating Softmax and cross-entropy into distinct operations can introduce numerical instability, leading to failed training runs.

References:

Continue reading

Next article

NVIDIA Releases Nemotron Speech ASR: Low-Latency Speech Recognition

Related Content