Skip to main content

On This Page

Building a GPT-2 Level LLM for $100: Analyzing Karpathy's nanochat Pipeline

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

nanochat — Karpathy가 보여주는 ‘$100으로 ChatGPT 만들기’의 전체 파이프라인

Andrej Karpathy, former Director of AI at Tesla and OpenAI, has released the nanochat project on GitHub. The system achieved over 45,883 stars by demonstrating a full LLM pipeline that costs only $48 to train on H100 GPUs. This project represents a massive leap in accessibility, reducing GPT-2 level training time from days to under two hours.

Why This Matters

The technical reality of LLM training has shifted from multi-million dollar investments to accessible $100 runs, as demonstrated by the nanochat pipeline. While ideal models often require complex manual tuning, nanochat utilizes Chinchilla scaling laws to automate hyperparameter selection through a single ‘—depth’ parameter. This automation addresses the failure scale of manual configuration, where sub-optimal ratios between model size and data volume lead to wasted compute.

By implementing bfloat16 precision and the Muon optimizer, the project shows how modern architectural choices can achieve GPT-2 performance at a fraction of the historical cost and energy. In 2019, training GPT-2 cost approximately $43,000; today, the same capability is achieved for roughly $48. This democratization allows engineers to move beyond using black-box APIs to building and understanding the entire stack from the ground up.

Key Insights

  • Significant Cost Reduction: GPT-2 level training costs dropped from $43,000 in 2019 to $48 in 2026 using H100 8-GPU nodes.
  • Compute-Optimal Scaling: The project uses Chinchilla scaling laws to automatically determine width, heads, and learning rates via the ‘—depth’ parameter.
  • Precision Optimization: Uses bfloat16 (brain floating point 16) to maintain float32 range with half the memory, doubling training speed on H100 hardware.
  • Advanced Optimizers: Supports both the standard AdamW and the experimental Muon optimizer for faster convergence in distributed training environments.
  • Full Pipeline Integration: Includes a GPT-4 style BPE tokenizer, pretraining, Supervised Fine-Tuning (SFT), and KV-cache based inference in one codebase.
  • Performance Benchmarking: Targets a CORE (DCLM) score of 0.2565 to mathematically verify model capabilities against the original GPT-2.

Working Examples

Automated hyperparameter selection based on model depth.

python scripts/pretrain.py --depth 12 # GPT-2 Small (124M)
python scripts/pretrain.py --depth 24 # GPT-2 Medium (350M)
python scripts/pretrain.py --depth 36 # GPT-2 Large (774M)
python scripts/pretrain.py --depth 48 # GPT-2 XL (1.5B)

Conceptual core of the next-token prediction pretraining loop.

for batch in dataloader:
    input_tokens = batch[:, :-1]
    target_tokens = batch[:, 1:]
    logits = model(input_tokens)
    loss = cross_entropy(logits, target_tokens)
    loss.backward()
    optimizer.step()

Practical Applications

  • Automated Model Scaling: Engineers can scale from 124M to 1.5B parameters without manually tuning learning rates or batch sizes by following compute-optimal laws.
  • Inference Latency Reduction: Implementing KV Cache in the inference engine to avoid redundant calculations of previous token keys and values.
  • Pitfall: Training without SFT (Supervised Fine-Tuning) results in models that repeat input patterns rather than following conversational instructions.
  • Pitfall: Using float16 instead of bfloat16 on modern H100/A100 GPUs, which leads to reduced numerical stability and slower training throughput.

References:

Continue reading

Next article

NVIDIA Nemotron 3 Super: 120B Parameter Hybrid MoE Model for Agentic AI

Related Content