Skip to main content

On This Page

Sigmoid vs ReLU: Why Geometric Context Preservation is Critical for Neural Network Inference

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context

Neural networks function as geometric systems where layers reshape input space to form decision boundaries. Sigmoid disrupts this by compressing inputs into a narrow range, while ReLU preserves magnitude to allow distance information to flow through deeper layers.

Why This Matters

In deep architectures, the loss of geometric context leads to weaker representations and stalls learning. In the provided experiment, Sigmoid’s compression caused training loss to plateau at 0.28, whereas ReLU’s preservation of signal magnitude allowed loss to drop to 0.03, demonstrating that activation choice directly impacts a model’s ability to utilize depth for expressive power.

Key Insights

  • Sigmoid-induced signal compression leads to nearly linear decision boundaries, achieving only 79% accuracy on the two-moons dataset compared to ReLU’s 96%.
  • Representation collapse in Sigmoid networks is evidenced by hidden space standard deviation dropping from 0.26 to 0.19 across layers, indicating a loss of expressivity.
  • ReLU maintains signal magnitude for positive inputs, enabling deeper layers to receive pre-activation values between 9.0 and 20.0, far exceeding Sigmoid’s 0.6 cap.
  • Xavier initialization is required for Sigmoid to stabilize initial variance, while ReLU relies on He initialization to account for the zero-rectified half-space.

Working Examples

Implementation of a 3-layer feedforward network to compare Sigmoid and ReLU signal propagation.

class TwoLayerNet:\n    def __init__(self, activation="relu", seed=0):\n        np.random.seed(seed)\n        self.act_name = activation\n        self.act = relu if activation == "relu" else sigmoid\n        self.dact = relu_d if activation == "relu" else sigmoid_d\n        scale = lambda fan_in: np.sqrt(2 / fan_in) if activation == "relu" else np.sqrt(1 / fan_in)\n        self.W1 = np.random.randn(2, 8) * scale(2)\n        self.b1 = np.zeros((1, 8))\n        self.W2 = np.random.randn(8, 8) * scale(8)\n        self.b2 = np.zeros((1, 8))\n        self.W3 = np.random.randn(8, 1) * scale(8)\n        self.b3 = np.zeros((1, 1))\n        self.loss_history = []\n\n    def forward(self, X, store=False):\n        z1 = X @ self.W1 + self.b1; a1 = self.act(z1)\n        z2 = a1 @ self.W2 + self.b2; a2 = self.act(z2)\n        z3 = a2 @ self.W3 + self.b3; out = sigmoid(z3)\n        if store: self._cache = (X, z1, a1, z2, a2, z3, out)\n        return out

Practical Applications

  • Use Case: Deep MLP architectures utilizing ReLU to ensure distance information compounds through depth. Pitfall: Hidden-layer Sigmoid use causing representation entanglement where classes become indistinguishable.
  • Use Case: Complexity-sensitive inference where ReLU’s sparse activation enables efficient decision surface refinement. Pitfall: Replacing ReLU with Sigmoid in deep stacks resulting in ‘wasted capacity’ where added layers fail to improve accuracy.

References:

Continue reading

Next article

Sourcery vs GitHub Copilot: Comparing Specialist AI Review and Generalist Generation

Related Content