Sigmoid vs ReLU: Why Geometric Context Preservation is Critical for Neural Network Inference

Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context

Neural networks function as geometric systems where layers reshape input space to form decision boundaries. Sigmoid disrupts this by compressing inputs into a narrow range, while ReLU preserves magnitude to allow distance information to flow through deeper layers.

Why This Matters

In deep architectures, the loss of geometric context leads to weaker representations and stalls learning. In the provided experiment, Sigmoid’s compression caused training loss to plateau at 0.28, whereas ReLU’s preservation of signal magnitude allowed loss to drop to 0.03, demonstrating that activation choice directly impacts a model’s ability to utilize depth for expressive power.

Key Insights

Sigmoid-induced signal compression leads to nearly linear decision boundaries, achieving only 79% accuracy on the two-moons dataset compared to ReLU’s 96%.
Representation collapse in Sigmoid networks is evidenced by hidden space standard deviation dropping from 0.26 to 0.19 across layers, indicating a loss of expressivity.
ReLU maintains signal magnitude for positive inputs, enabling deeper layers to receive pre-activation values between 9.0 and 20.0, far exceeding Sigmoid’s 0.6 cap.
Xavier initialization is required for Sigmoid to stabilize initial variance, while ReLU relies on He initialization to account for the zero-rectified half-space.

Working Examples

Implementation of a 3-layer feedforward network to compare Sigmoid and ReLU signal propagation.

class TwoLayerNet:\n    def __init__(self, activation="relu", seed=0):\n        np.random.seed(seed)\n        self.act_name = activation\n        self.act = relu if activation == "relu" else sigmoid\n        self.dact = relu_d if activation == "relu" else sigmoid_d\n        scale = lambda fan_in: np.sqrt(2 / fan_in) if activation == "relu" else np.sqrt(1 / fan_in)\n        self.W1 = np.random.randn(2, 8) * scale(2)\n        self.b1 = np.zeros((1, 8))\n        self.W2 = np.random.randn(8, 8) * scale(8)\n        self.b2 = np.zeros((1, 8))\n        self.W3 = np.random.randn(8, 1) * scale(8)\n        self.b3 = np.zeros((1, 1))\n        self.loss_history = []\n\n    def forward(self, X, store=False):\n        z1 = X @ self.W1 + self.b1; a1 = self.act(z1)\n        z2 = a1 @ self.W2 + self.b2; a2 = self.act(z2)\n        z3 = a2 @ self.W3 + self.b3; out = sigmoid(z3)\n        if store: self._cache = (X, z1, a1, z2, a2, z3, out)\n        return out

Practical Applications

Use Case: Deep MLP architectures utilizing ReLU to ensure distance information compounds through depth. Pitfall: Hidden-layer Sigmoid use causing representation entanglement where classes become indistinguishable.
Use Case: Complexity-sensitive inference where ReLU’s sparse activation enables efficient decision surface refinement. Pitfall: Replacing ReLU with Sigmoid in deep stacks resulting in ‘wasted capacity’ where added layers fail to improve accuracy.

References:

https://www.marktechpost.com/2026/04/09/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context/

On This Page

Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Understanding Neural Network Architecture: From Pixels to Feature Detection

From Transformers to Associative Memory, How Titans and MIRAS Rethink Long Context Modeling

Meet LLMRouter: An Intelligent Routing System for Optimized LLM Inference