Why Gradient Descent Zigzags and How Momentum Fixes It
These articles are AI-generated summaries. Please check the original sources for full details.
Why Gradient Descent Zigzags and How Momentum Fixes It
Standard gradient descent is fundamentally inefficient on loss surfaces with uneven curvature, often requiring significantly more iterations to converge. In a controlled simulation, vanilla gradient descent took 185 steps to reach the minimum, whereas Momentum optimization achieved convergence in 159 steps.
Why This Matters
In real-world neural network training, loss surfaces are rarely symmetric and typically exhibit high condition numbers, where curvature is 100x steeper in one direction than another. This technical reality forces a trade-off where standard gradient descent must use a low learning rate to avoid divergence in steep regions, which inadvertently causes near-stagnation in flat regions where progress is most needed.
Key Insights
- Anisotropic surfaces with high condition numbers (e.g., 100) force vanilla gradient descent into inefficient zigzagging patterns.
- Momentum introduces a velocity term that acts as an exponential moving average of past gradients to smooth parameter updates.
- In steep directions, alternating gradient signs cancel out in the velocity update, effectively dampening oscillations.
- Consistent gradients in flatter directions accumulate over time, allowing the optimizer to accelerate across plateaus.
- The stability limit for gradient descent is 2/lambda_max; exceeding this results in immediate divergence, as seen with beta=0.99.
Working Examples
Comparison of Vanilla Gradient Descent and Momentum update logic.
def gradient_descent(start, lr, steps=300):\n path = [np.array(start, dtype=float)]\n pos = np.array(start, dtype=float)\n for _ in range(steps):\n pos = pos - lr * grad(*pos)\n path.append(pos.copy())\n return np.array(path)\n\ndef momentum_gd(start, lr, beta, steps=300):\n path = [np.array(start, dtype=float)]\n pos = np.array(start, dtype=float)\n v = np.zeros(2)\n for _ in range(steps):\n g = grad(*pos)\n v = beta * v + (1 - beta) * g\n pos = pos - lr * v\n path.append(pos.copy())\n return np.array(path)
Practical Applications
- Use Case: Training deep neural networks on complex loss surfaces where beta=0.9 serves as the typical sweet spot for stabilizing updates. Pitfall: Setting beta too high (e.g., 0.99) causes the optimizer to overshoot the minimum and fail to stabilize.
- Use Case: Navigating anisotropic bowls where one axis is 100x steeper than the other to reduce convergence steps from 185 to 159. Pitfall: Using a learning rate above the stability limit (2 / lambda_max) which causes the optimizer to diverge outright.
References:
Continue reading
Next article
ZenWinHook: Achieving Thread-Safe Windows Hooking and Instruction Relocation in C++
Related Content
Building an End-to-End Data Engineering and Machine Learning Pipeline with PySpark in Google Colab
A step-by-step guide to using PySpark in Google Colab for data transformations, SQL analytics, feature engineering, and machine learning model training.
How Can We Build Scalable and Reproducible Machine Learning Experiment Pipelines Using Meta Research Hydra?
This article explains how to use Meta's Hydra framework to create scalable and reproducible ML experiments through structured configurations, overrides, and multirun simulations.
Production-Grade Graph Analytics with NetworKit 11.2.1: A Tutorial for Large-Scale Networks
Learn to implement a production-grade graph analytics pipeline using NetworKit 11.2.1, processing up to 250,000 nodes with optimized community detection, core decomposition, and local similarity sparsification.