Instructions to use Efficient-Large-Model/SANA-WM_bidirectional with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Efficient-Large-Model/SANA-WM_bidirectional with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Efficient-Large-Model/SANA-WM_bidirectional", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
SANA-WM (Bidirectional)
SANA-WM is an efficient open-source world model trained natively for one-minute generation. The bidirectional checkpoint released here is a 2.6B-parameter image-to-video diffusion transformer that synthesises 720p, minute-scale videos with precise 6-DoF camera control, paired with the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.
Four core designs drive the architecture:
- Hybrid Linear Attention — frame-wise Gated DeltaNet combined with softmax attention every Nth block for memory-efficient long-context modelling.
- Dual-Branch Camera Control — independent main and camera branches enable precise per-frame trajectory adherence.
- Two-Stage Generation Pipeline — a long-video refiner stitched on top of Stage-1 latents improves quality and temporal consistency.
- Robust Annotation Pipeline — metric-scale 6-DoF camera poses extracted from public video corpora yield spatiotemporally consistent action supervision.
Paper: https://arxiv.org/abs/2605.15178
@article{zhu2026sanawm,
title = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
author = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze},
journal = {arXiv preprint arXiv:2605.15178},
year = {2026},
}
Repository layout
| Component | Path in repo | Size |
|---|---|---|
| Sana DiT (Stage 1) | dit/sana_wm_1600m_720p.safetensors |
10 GB |
| LTX-2 VAE (diffusers) | vae/ |
2 GB |
| LTX-2 refiner (Stage 2) | refiner/refiner.safetensors |
41 GB |
| Gemma text encoder for the refiner | refiner/text_encoder/ |
46 GB |
| Inference config | config.yaml |
— |
The Sana text encoder (gemma-2-2b-it) is not bundled here — it is
fetched on demand from the public Hugging Face mirror.
Usage
python inference_video_scripts/inference_sana_wm.py \
--image asset/sana_wm/demo_0.png \
--prompt asset/sana_wm/demo_0.txt \
--action "w-80,jw-40,w-40,lw-60,w-100" \
--translation_speed 0.055 \
--rotation_speed_deg 1.2 \
--num_frames 321 \
--output_dir results/demo
Weights are fetched from this repository on first use. Pass --no_refiner
to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE
instead. To run fully offline, override any of --config / --model_path /
--refiner_checkpoint / --refiner_gemma_root with local paths.
Inputs
| Argument | Format |
|---|---|
--image |
RGB image (any PIL-readable format) — used as the first frame. |
--prompt |
UTF-8 text file containing the conditioning prompt. |
--camera |
NumPy .npy of shape (F, 4, 4) — per-frame camera-to-world matrices. |
--action |
WASD/IJKL DSL, e.g. "w-80,jw-40,w-40,lw-60,w-100". We roll it out to a (F+1, 4, 4) trajectory. Mutually exclusive with --camera. |
--intrinsics |
Optional. .npy of shape (3, 3), (F, 3, 3), or (4,). If omitted, we estimate intrinsics from --image with Pi3X and abort if the resulting FOV is outside [25°, 120°]. |
The output frame size is fixed at 704 x 1280; input images are
aspect-preserving resized + center-cropped to that resolution.
License
Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE inherit the LTX-2 upstream license.
- Downloads last month
- -