Skip to main content

On This Page

Nous Research Debuts Lighthouse Attention for 1.7x Faster Long-Context Pretraining

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

Researchers at Nous Research have developed Lighthouse Attention, a training-only method that addresses the quadratic compute scaling of traditional transformer attention. The system achieves a 1.40× to 1.69× end-to-end wall-clock speedup compared to cuDNN-backed SDPA baselines while maintaining or improving final training loss.

Why This Matters

Standard Scaled Dot-Product Attention (SDPA) scales quadratically, making long-context training prohibitively expensive even with IO-aware tiling like FlashAttention. While inference-time sparse methods often degrade performance or require custom kernels, Lighthouse Attention utilizes a symmetric pyramid pooling design that allows researchers to leverage optimized dense-attention kernels during training, effectively bypassing the Θ(N²) bottleneck for sequences up to 1 million tokens.

Key Insights

  • Lighthouse Attention delivers 21× faster forward passes on NVIDIA B200 GPUs at 512K context by symmetric pooling (Nous Research, 2026).
  • Symmetric pooling of Q, K, and V transforms computational cost to O(S² d), where S is the size of the gathered dense sub-sequence.
  • The chunked-bitonic top-K kernel acts as a stratified selector to prevent selection collapse onto narrow spans during the selection stage.
  • Two-stage training recovery: Models resume dense SDPA after Lighthouse training, crossing below the dense baseline loss by step 16,000 (~50.3B tokens).
  • Context Parallelism integration: Lighthouse scales to 1M tokens across 32 Blackwell GPUs using standard ring attention because the gathered sub-sequence is dense.

Practical Applications

  • High-throughput Pretraining: Using Lighthouse for 1.69x end-to-end wall-clock speedup on long sequences; pitfall is using overly deep pyramids (L=5) which can underperform shallower L=3 configurations.
  • Large-scale Context Processing: Scaling to 1M tokens via context parallelism (CP degree 8); pitfall is ignoring the 10% throughput overhead from ring rotation during rotation-heavy phases.
  • Needle-in-a-Haystack Retrieval: Implementing larger k-values (e.g., 2048) for high-accuracy information retrieval; pitfall is using a projection-norm scorer for retrieval-heavy tasks, which can reduce mean retrieval rates.

References:

Continue reading

Next article

NVIDIA SANA-WM: 2.6B-Parameter World Model for 720p Minute-Scale Video on Single GPUs

Related Content