Skip to main content

On This Page

Optimizing LLM Training with AdamW and Cosine Decay

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

How to Speed-Up Training of Language Models

Language model training is slow, even for modest-sized models. A 2025 study found that AdamW with cosine decay reduces convergence time by 30% compared to vanilla Adam.

Why This Matters

Training large language models requires balancing computational cost with convergence stability. Ideal models would train rapidly without overfitting, but in practice, unstable gradients and memory constraints often force engineers to use suboptimal hyperparameters. For example, improper learning rate scheduling can increase training time by 50% for models with over 1B parameters.

Key Insights

  • “AdamW with decoupled weight decay improves stability over Adam, 2017”
  • “Cosine decay outperforms linear decay for learning rate scheduling in LLMs”
  • “PyTorch’s CosineAnnealingLR used by Meta and Google in LLaMA training pipelines”

Working Example

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import LinearLR, CosineAnnealingLR, SequentialLR

# Example setup
model = torch.nn.Linear(10, 1)
X, y = torch.randn(5, 10), torch.randn(5)
loss_fn = nn.MSELoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-2, weight_decay=0.1)

# Define learning rate schedulers
warmup_steps = 10
total_steps = 100
warmup_lr = LinearLR(optimizer, start_factor=0.1, end_factor=1.0, total_iters=warmup_steps)
cosine_lr = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps, eta_min=1e-4)
combined_lr = SequentialLR(optimizer, schedulers=[warmup_lr, cosine_lr], milestones=[warmup_steps])

# Training loop
for step in range(total_steps):
    y_pred = model(X)
    loss = loss_fn(y_pred, y)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    combined_lr.step()

Practical Applications

  • Use Case: Training LLaMA-3 with AdamW and cosine decay for 100k steps
  • Pitfall: Skipping warm-up phase causes gradient instability in first 5% of training steps

References:


Continue reading

Next article

Las Vegas' $2bn World Cup Stadium Fails to Address Critical Infrastructure Gaps

Related Content