Skip to content
Free · 1,000+ readers
Free · Independent
The daily record of artificial intelligence
← Back
AI

NVIDIA-affiliated researchers publish SANA-WM world model paper and code

The paper claims a distilled version can denoise a minute of 720p video on a single RTX 5090, but model weights and documentation are not yet public.

Monday, May 18, 2026 · min
NVIDIA-affiliated researchers publish SANA-WM world model paper and code

Nine researchers from NVIDIA’s NVlabs published a paper and code repository on May 14 detailing SANA-WM, a 2.6‑billion‑parameter world model that generates 60‑second, 720p video from a single starting image and a six‑degrees‑of‑freedom camera path. The arXiv submission, timestamped 17:58 UTC, marks a research step toward efficient long‑horizon video generation but does not yet offer a ready‑to‑run product: model weights and dedicated documentation were not publicly available at check time.

The work matters because minute‑scale world models have historically required industrial compute clusters. The authors demonstrate that a distilled and quantized variant can denoise a full minute of 720p footage on a single GeForce RTX 5090, pointing to more accessible inference. Training the model, however, remains a data‑center exercise—64 H100 GPUs over roughly two weeks.

Unlike text‑to‑video systems, SANA‑WM takes an initial frame and a precise 6‑DoF camera trajectory as input, then predicts how the scene evolves as the viewpoint moves. The output is a coherent minute‑long clip at 720p resolution. Such world models are sought after for robotics simulation, autonomous vehicle training, and generating synthetic training data, where long‑range consistency reduces the need for costly real‑world capture.

The paper sits within NVlabs’ SANA family, which has previously released efficient image and video models such as SANA‑Video and LongSANA. For SANA‑WM, the team trained the core diffusion transformer on about 213,000 publicly available video clips paired with metric‑scale pose supervision; the paper reports exactly 212,975 clips. That training ran for approximately 15 days on 64 H100 GPUs, with the full pipeline consuming up to 74.7 GB of memory. Some secondary accounts add a separate 3.5‑day adaptation phase for the variational autoencoder, bringing total preparation closer to 18 days.

On the inference side, the paper’s most discussed figure is the 34‑second denoising time on a single GeForce RTX 5090. The authors report that a distilled variant using Nvidia’s NVFP4 quantization achieved that speed after applying a memory‑management technique called “sink” to avoid out‑of‑memory errors. The 34 seconds cover denoising only; end‑to‑end wall‑clock time—including frame preprocessing, decoding, and an optional refiner—remains unmeasured outside the paper.

In a self‑reported benchmark, SANA‑WM with its refiner processed 22.0 one‑minute videos per hour on eight H100s, a 36‑fold throughput advantage over LingBot‑World’s 0.6 videos per hour and a comparable lead over HY‑WorldPlay. The authors also claim comparable visual quality under VBench metrics. No independent replication of these results has appeared.

Despite the paper’s description of SANA‑WM as an open‑source model and the GitHub repository stating it is “released,” the dedicated documentation page returned a 404 error, and no model weights were attached to the Hugging Face paper page. Nvidia has not issued a corporate press release, keeping the announcement within research channels. The code repository carries an Apache‑2.0 license, but the license for any eventual weights remains unknown.

The SANA‑WM paper adds weight to the argument that inference for long‑horizon world models can shift toward consumer‑grade hardware, at least for aggressively quantized variants. For research teams without access to large GPU clusters, that possibility is significant—provided the weights, inference scripts, and independent benchmarks eventually arrive. For now, the work stands as a notable efficiency benchmark with an incomplete public release.

— End —