Skip to main content

On This Page

NVIDIA SANA-WM: 2.6B-Parameter World Model for 720p Minute-Scale Video on Single GPUs

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU

NVIDIA has released SANA-WM, an open-source 2.6B-parameter Diffusion Transformer capable of synthesizing 60-second 720p video sequences with metric-scale 6-DoF camera control. This system achieves 36x higher throughput than multi-GPU baselines by utilizing a novel hybrid linear attention architecture and frame-wise Gated DeltaNet.

Why This Matters

Standard world models designed for embodied AI often suffer from quadratic computational complexity, making minute-scale, high-resolution video generation impossible on single GPUs. Most existing open-source models require massive multi-GPU clusters for inference or sacrifice visual fidelity and temporal consistency to stay within memory limits, hindering the development of scalable robotics simulations. SANA-WM addresses these technical bottlenecks by replacing memory-intensive softmax attention with frame-wise Gated DeltaNet (GDN) recurrence, which maintains a constant-size memory state regardless of video length. This architectural shift allows researchers to generate high-quality 720p data at 22.0 videos per hour on modest hardware configurations, democratizing the production of long-horizon synthetic environments.

Key Insights

  • Hybrid Recurrence (2026): SANA-WM interleaves 15 frame-wise Gated DeltaNet (GDN) blocks with 5 softmax attention blocks to maintain a constant D!×D recurrent state while ensuring long-range spatial recall.
  • Algebraic Key-Scaling: Scaling keys by 1/√(D·S) eliminates NaN divergence events during training, a failure mode observed at step 16 with standard L2 normalization.
  • Dual-Branch Camera Control: NVIDIA’s approach combines latent-frame UCPE attention with raw-frame Plücker mixing to capture both global trajectory and intra-stride camera motion, achieving a CamMC of 0.2047.
  • Single-GPU Efficiency: Using NVFP4 quantization, the distilled variant denoises a 60-second 720p clip in 34 seconds on a single RTX 5090 GPU.
  • Drift Mitigation: A second-stage refiner using rank-384 LoRA adapters on a 17B LTX-2 model reduces long-horizon imaging quality degradation (ΔIQ) from 3.09 to 0.31 on hard trajectories.
  • Metric-Scale Annotation: The training pipeline utilized a modified VIPE engine with Pi3X and MoGe-2 to generate 6-DoF pose data for 212,975 clips across real and synthetic datasets.

Practical Applications

  • Embodied AI Simulation: Using SANA-WM to generate long-horizon environmental rollouts for robotics training; a common pitfall is using softmax-only models which cause OOM errors during 60-second generation.
  • Synthetic Data Generation: Producing high-fidelity 720p training video for autonomous systems on single-GPU workstations; neglecting the fine-branch Plücker mixing can lead to loss of intra-frame motion accuracy.
  • Rapid Prototyping: Deploying the few-step distilled variant for interactive world-model synthesis; failing to use the second-stage refiner results in significant structural artifacts over minute-scale sequences.

References:

Continue reading

Next article

Building SMM Turbo: A High-Performance Svelte 5 Graphic Editor Powered by Gemma 4

Related Content