Optimizing LLM Training with AdamW and Cosine Decay

How to Speed-Up Training of Language Models

Language model training is slow, even for modest-sized models. A 2025 study found that AdamW with cosine decay reduces convergence time by 30% compared to vanilla Adam.

Why This Matters

Training large language models requires balancing computational cost with convergence stability. Ideal models would train rapidly without overfitting, but in practice, unstable gradients and memory constraints often force engineers to use suboptimal hyperparameters. For example, improper learning rate scheduling can increase training time by 50% for models with over 1B parameters.

Key Insights

“AdamW with decoupled weight decay improves stability over Adam, 2017”
“Cosine decay outperforms linear decay for learning rate scheduling in LLMs”
“PyTorch’s CosineAnnealingLR used by Meta and Google in LLaMA training pipelines”

Working Example

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import LinearLR, CosineAnnealingLR, SequentialLR

# Example setup
model = torch.nn.Linear(10, 1)
X, y = torch.randn(5, 10), torch.randn(5)
loss_fn = nn.MSELoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-2, weight_decay=0.1)

# Define learning rate schedulers
warmup_steps = 10
total_steps = 100
warmup_lr = LinearLR(optimizer, start_factor=0.1, end_factor=1.0, total_iters=warmup_steps)
cosine_lr = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps, eta_min=1e-4)
combined_lr = SequentialLR(optimizer, schedulers=[warmup_lr, cosine_lr], milestones=[warmup_steps])

# Training loop
for step in range(total_steps):
    y_pred = model(X)
    loss = loss_fn(y_pred, y)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    combined_lr.step()

Practical Applications

Use Case: Training LLaMA-3 with AdamW and cosine decay for 100k steps
Pitfall: Skipping warm-up phase causes gradient instability in first 5% of training steps

References:

https://machinelearningmastery.com/how-to-speed-up-training-of-language-models/
https://arxiv.org/abs/1711.05101 (AdamW paper)
https://arxiv.org/abs/1804.07612 (SGDR paper)
https://arxiv.org/abs/2312.12813 (Benchmarking Optimizers for LLMs)

On This Page

How to Speed-Up Training of Language Models

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

The Critical Role of Datasets in Training Language Models

A Coding Guide to Build a Procedural Memory Agent That Learns, Stores, Retrieves, and Reuses Skills as Neural Modules Over Time

Optimizing LLM Throughput: How Paged Attention Achieves 98.5% Memory Utilization