Sigmoid vs ReLU: Why Geometric Context Preservation is Critical for Neural Network Inference
These articles are AI-generated summaries. Please check the original sources for full details.
Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context
Neural networks function as geometric systems where layers reshape input space to form decision boundaries. Sigmoid disrupts this by compressing inputs into a narrow range, while ReLU preserves magnitude to allow distance information to flow through deeper layers.
Why This Matters
In deep architectures, the loss of geometric context leads to weaker representations and stalls learning. In the provided experiment, Sigmoid’s compression caused training loss to plateau at 0.28, whereas ReLU’s preservation of signal magnitude allowed loss to drop to 0.03, demonstrating that activation choice directly impacts a model’s ability to utilize depth for expressive power.
Key Insights
- Sigmoid-induced signal compression leads to nearly linear decision boundaries, achieving only 79% accuracy on the two-moons dataset compared to ReLU’s 96%.
- Representation collapse in Sigmoid networks is evidenced by hidden space standard deviation dropping from 0.26 to 0.19 across layers, indicating a loss of expressivity.
- ReLU maintains signal magnitude for positive inputs, enabling deeper layers to receive pre-activation values between 9.0 and 20.0, far exceeding Sigmoid’s 0.6 cap.
- Xavier initialization is required for Sigmoid to stabilize initial variance, while ReLU relies on He initialization to account for the zero-rectified half-space.
Working Examples
Implementation of a 3-layer feedforward network to compare Sigmoid and ReLU signal propagation.
class TwoLayerNet:\n def __init__(self, activation="relu", seed=0):\n np.random.seed(seed)\n self.act_name = activation\n self.act = relu if activation == "relu" else sigmoid\n self.dact = relu_d if activation == "relu" else sigmoid_d\n scale = lambda fan_in: np.sqrt(2 / fan_in) if activation == "relu" else np.sqrt(1 / fan_in)\n self.W1 = np.random.randn(2, 8) * scale(2)\n self.b1 = np.zeros((1, 8))\n self.W2 = np.random.randn(8, 8) * scale(8)\n self.b2 = np.zeros((1, 8))\n self.W3 = np.random.randn(8, 1) * scale(8)\n self.b3 = np.zeros((1, 1))\n self.loss_history = []\n\n def forward(self, X, store=False):\n z1 = X @ self.W1 + self.b1; a1 = self.act(z1)\n z2 = a1 @ self.W2 + self.b2; a2 = self.act(z2)\n z3 = a2 @ self.W3 + self.b3; out = sigmoid(z3)\n if store: self._cache = (X, z1, a1, z2, a2, z3, out)\n return out
Practical Applications
- Use Case: Deep MLP architectures utilizing ReLU to ensure distance information compounds through depth. Pitfall: Hidden-layer Sigmoid use causing representation entanglement where classes become indistinguishable.
- Use Case: Complexity-sensitive inference where ReLU’s sparse activation enables efficient decision surface refinement. Pitfall: Replacing ReLU with Sigmoid in deep stacks resulting in ‘wasted capacity’ where added layers fail to improve accuracy.
References:
Continue reading
Next article
Sourcery vs GitHub Copilot: Comparing Specialist AI Review and Generalist Generation
Related Content
From Transformers to Associative Memory, How Titans and MIRAS Rethink Long Context Modeling
Google Research’s Titans and MIRAS address the quadratic scaling issue of Transformers, achieving state-of-the-art results on benchmarks like BABILong with context windows exceeding 2,000,000 tokens.
Meet LLMRouter: An Intelligent Routing System for Optimized LLM Inference
LLMRouter, an open-source library from UIUC, optimizes LLM inference by dynamically selecting the most suitable model for each query, achieving up to 21% accuracy gains.
Meta AI and KAUST Propose Neural Computers: Folding Computation and Memory into One Learned Model
Meta AI and KAUST researchers introduce Neural Computers (NCs), achieving 98.7% cursor accuracy in GUI prototypes by folding OS functions into a single learned runtime state.