NVIDIA SANA-WM: 2.6B-Parameter World Model for 720p Minute-Scale Video on Single GPUs

NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU

NVIDIA has released SANA-WM, an open-source 2.6B-parameter Diffusion Transformer capable of synthesizing 60-second 720p video sequences with metric-scale 6-DoF camera control. This system achieves 36x higher throughput than multi-GPU baselines by utilizing a novel hybrid linear attention architecture and frame-wise Gated DeltaNet.

Why This Matters

Standard world models designed for embodied AI often suffer from quadratic computational complexity, making minute-scale, high-resolution video generation impossible on single GPUs. Most existing open-source models require massive multi-GPU clusters for inference or sacrifice visual fidelity and temporal consistency to stay within memory limits, hindering the development of scalable robotics simulations. SANA-WM addresses these technical bottlenecks by replacing memory-intensive softmax attention with frame-wise Gated DeltaNet (GDN) recurrence, which maintains a constant-size memory state regardless of video length. This architectural shift allows researchers to generate high-quality 720p data at 22.0 videos per hour on modest hardware configurations, democratizing the production of long-horizon synthetic environments.

Key Insights

Hybrid Recurrence (2026): SANA-WM interleaves 15 frame-wise Gated DeltaNet (GDN) blocks with 5 softmax attention blocks to maintain a constant D!×D recurrent state while ensuring long-range spatial recall.
Algebraic Key-Scaling: Scaling keys by 1/√(D·S) eliminates NaN divergence events during training, a failure mode observed at step 16 with standard L2 normalization.
Dual-Branch Camera Control: NVIDIA’s approach combines latent-frame UCPE attention with raw-frame Plücker mixing to capture both global trajectory and intra-stride camera motion, achieving a CamMC of 0.2047.
Single-GPU Efficiency: Using NVFP4 quantization, the distilled variant denoises a 60-second 720p clip in 34 seconds on a single RTX 5090 GPU.
Drift Mitigation: A second-stage refiner using rank-384 LoRA adapters on a 17B LTX-2 model reduces long-horizon imaging quality degradation (ΔIQ) from 3.09 to 0.31 on hard trajectories.
Metric-Scale Annotation: The training pipeline utilized a modified VIPE engine with Pi3X and MoGe-2 to generate 6-DoF pose data for 212,975 clips across real and synthetic datasets.

Practical Applications

Embodied AI Simulation: Using SANA-WM to generate long-horizon environmental rollouts for robotics training; a common pitfall is using softmax-only models which cause OOM errors during 60-second generation.
Synthetic Data Generation: Producing high-fidelity 720p training video for autonomous systems on single-GPU workstations; neglecting the fine-branch Plücker mixing can lead to loss of intra-frame motion accuracy.
Rapid Prototyping: Deploying the few-step distilled variant for interactive world-model synthesis; failing to use the second-stage refiner results in significant structural artifacts over minute-scale sequences.

References:

https://www.marktechpost.com/2026/05/16/nvidia-introduces-sana-wm-a-2-6b-parameter-open-source-world-model-that-generates-minute-scale-720p-video-on-a-single-gpu/

On This Page

NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

NVIDIA's Tile-Based Programming: A New Era for AI Development

Google Launches TensorFlow 2.21 and LiteRT for Enhanced Edge Inference

The Convergence of Transformers, Data, and GPUs: The Real LLM Story