Instructions to use youngsong305/MV-S2V with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use youngsong305/MV-S2V with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("youngsong305/MV-S2V", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
MV-S2V: Multi-View Subject-Consistent Video Generation
MV-S2V: Multi-View Subject-Consistent Video Generation
Ziyang Song1, Xinyu Gong2, Bangya Liu3, Zelin Zhao4
1The Hong Kong Polytechnic University 2The University of Texas at Austin 3University of WisconsinβMadison 4Georgia Institute of Technology
SIGGRAPH 2026
π Overview
MV-S2V is the first framework that synthesizes videos from multiple reference views of the same subject to enforce 3D-level subject consistency. Existing subject-to-video methods condition on a single reference image and are therefore forced to hallucinate unseen details when generating novel views. MV-S2V tackles this by:
- π§ Multi-view conditioning β accepts an arbitrary number of reference views and produces a coherent video that respects all of them.
- π Temporally-Shifted RoPE (TS-RoPE) β a positional-encoding scheme that disambiguates cross-subject and cross-view references.
- π¦ Synthetic + real-world dataset β a data curation pipeline for high-quality multi-view subject-to-video training data.
The model in this repo is a 14B Subject-to-Video DiT built on top of Wan2.1.
π Todo List
- Inference code
- Pretrained 14B checkpoint
- Benchmark dataset (MV-S2V-Bench)
- Evaluation suite
β‘οΈ Quickstart
Installation
git clone https://github.com/szy-young/mv-s2v.git
cd mv-s2v
# Ensure torch >= 2.4.0
pip install -r requirements.txt
# For multi-GPU inference
pip install "xfuser>=0.4.1"
Model Download
| Model | Resolution | Download |
|---|---|---|
| MV-S2V-14B | 480p | π€ HuggingFace |
MV-S2V re-uses the VAE and T5 text encoder from Wan2.1-T2V-14B. Download the base checkpoint first:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
Then place our DiT weights at ./checkpoints/diffusion_pytorch_model[.safetensors|.safetensors.index.json].
Run Multi-View Subject-to-Video Generation
The simplest way to reproduce all examples in this repo:
bash infer.sh
infer.sh exposes two paths to edit:
WAN_CKPT="./Wan2.1-T2V-14B" # Wan2.1 base (VAE + T5)
DIT_CKPT="./checkpoints/diffusion_pytorch_model" # MV-S2V DiT weights
Single-prompt inference (multi-GPU FSDP + xDiT USP)
torchrun --nproc_per_node=2 --master-port 14435 generate.py \
--task s2v-14B \
--size 640*640 \
--frame_num 121 --sample_fps 24 \
--ckpt_dir ${WAN_CKPT} \
--phantom_ckpt ${DIT_CKPT} \
--dit_fsdp --t5_fsdp --ulysses_size 2 \
--ref_image "examples/object_images/box_multi_door_colored/object_0.png,examples/object_images/box_multi_door_colored/object_1.png,examples/object_images/box_multi_door_colored/object_2.png,examples/object_images/box_multi_door_colored/object_3.png" \
--prompt "The video starts with a small, wooden activity cube with colorful panels and metal locks, topped with a teal plastic handle, sitting on a child's playroom floor with a soft rug and scattered toys in the background. As the camera smoothly orbits around the activity cube, the background gradually reveals a low bookshelf filled with children's books and puzzles on one side, and a painted wall adorned with cheerful animal decals and a cozy reading nook with cushions on the other." \
--save_file examples/videos/box_multi_door_colored.mp4 \
--base_seed 42 \
--rpe_mode ts_rope \
--sample_guide_scale_img 2.5
Key arguments
| Flag | Description |
|---|---|
--ref_image |
Comma-separated paths to reference images. Mix object and human views freely (β€ 4 object views + 1 optional human view). |
--rpe_mode |
RoPE scheme for reference tokens: vanilla, ss_rope (spatial-shift), ts_rope (our Temporally-Shifted RoPE, recommended). |
--sample_guide_scale_img |
Image classifier-free guidance scale. We use 2.5 for object-centric and HOI scenes. |
--view_number |
Sub-sample views at inference time (-1 = use all). |
--ulysses_size |
xDiT Ulysses parallel degree (set equal to --nproc_per_node). |
π Acknowledgements
This project builds on the open-source efforts of Wan2.1 and Phantom. We thank their authors for releasing high-quality video generation backbones.
β Citation
If you find MV-S2V useful for your research, please cite our paper and β this repo.
@inproceedings{song2026mvs2v,
title = {MV-S2V: Multi-View Subject-Consistent Video Generation},
author = {Song, Ziyang and Gong, Xinyu and Liu, Bangya and Zhao, Zelin},
booktitle = {ACM SIGGRAPH},
year = {2026}
}
π§ Contact
For questions or collaboration, please open a GitHub issue or contact Ziyang Song.
- Downloads last month
- 48