YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

MV-S2V: Multi-View Subject-Consistent Video Generation

arXiv  project page      SIGGRAPH 2026

MV-S2V: Multi-View Subject-Consistent Video Generation
Ziyang Song1, Xinyu Gong2, Bangya Liu3, Zelin Zhao4
1The Hong Kong Polytechnic University    2The University of Texas at Austin    3University of Wisconsin–Madison    4Georgia Institute of Technology

SIGGRAPH 2026

πŸ“– Overview

MV-S2V is the first framework that synthesizes videos from multiple reference views of the same subject to enforce 3D-level subject consistency. Existing subject-to-video methods condition on a single reference image and are therefore forced to hallucinate unseen details when generating novel views. MV-S2V tackles this by:

  • 🧊 Multi-view conditioning β€” accepts an arbitrary number of reference views and produces a coherent video that respects all of them.
  • πŸŒ€ Temporally-Shifted RoPE (TS-RoPE) β€” a positional-encoding scheme that disambiguates cross-subject and cross-view references.
  • πŸ“¦ Synthetic + real-world dataset β€” a data curation pipeline for high-quality multi-view subject-to-video training data.

The model in this repo is a 14B Subject-to-Video DiT built on top of Wan2.1.

πŸ“‘ Todo List

  • Inference code
  • Pretrained 14B checkpoint
  • Benchmark dataset (MV-S2V-Bench)
  • Evaluation suite

⚑️ Quickstart

Installation

git clone https://github.com/szy-young/mv-s2v.git
cd mv-s2v
# Ensure torch >= 2.4.0
pip install -r requirements.txt
# For multi-GPU inference
pip install "xfuser>=0.4.1"

Model Download

Model Resolution Download
MV-S2V-14B 480p πŸ€— HuggingFace

MV-S2V re-uses the VAE and T5 text encoder from Wan2.1-T2V-14B. Download the base checkpoint first:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B

Then place our DiT weights at ./checkpoints/diffusion_pytorch_model[.safetensors|.safetensors.index.json].

Run Multi-View Subject-to-Video Generation

The simplest way to reproduce all examples in this repo:

bash infer.sh

infer.sh exposes two paths to edit:

WAN_CKPT="./Wan2.1-T2V-14B"               # Wan2.1 base (VAE + T5)
DIT_CKPT="./checkpoints/diffusion_pytorch_model"  # MV-S2V DiT weights

Single-prompt inference (multi-GPU FSDP + xDiT USP)

torchrun --nproc_per_node=2 --master-port 14435 generate.py \
    --task s2v-14B \
    --size 640*640 \
    --frame_num 121 --sample_fps 24 \
    --ckpt_dir ${WAN_CKPT} \
    --phantom_ckpt ${DIT_CKPT} \
    --dit_fsdp --t5_fsdp --ulysses_size 2 \
    --ref_image "examples/object_images/box_multi_door_colored/object_0.png,examples/object_images/box_multi_door_colored/object_1.png,examples/object_images/box_multi_door_colored/object_2.png,examples/object_images/box_multi_door_colored/object_3.png" \
    --prompt "The video starts with a small, wooden activity cube with colorful panels and metal locks, topped with a teal plastic handle, sitting on a child's playroom floor with a soft rug and scattered toys in the background. As the camera smoothly orbits around the activity cube, the background gradually reveals a low bookshelf filled with children's books and puzzles on one side, and a painted wall adorned with cheerful animal decals and a cozy reading nook with cushions on the other." \
    --save_file examples/videos/box_multi_door_colored.mp4 \
    --base_seed 42 \
    --rpe_mode ts_rope \
    --sample_guide_scale_img 2.5

Key arguments

Flag Description
--ref_image Comma-separated paths to reference images. Mix object and human views freely (≀ 4 object views + 1 optional human view).
--rpe_mode RoPE scheme for reference tokens: vanilla, ss_rope (spatial-shift), ts_rope (our Temporally-Shifted RoPE, recommended).
--sample_guide_scale_img Image classifier-free guidance scale. We use 2.5 for object-centric and HOI scenes.
--view_number Sub-sample views at inference time (-1 = use all).
--ulysses_size xDiT Ulysses parallel degree (set equal to --nproc_per_node).

πŸ™ Acknowledgements

This project builds on the open-source efforts of Wan2.1 and Phantom. We thank their authors for releasing high-quality video generation backbones.

⭐ Citation

If you find MV-S2V useful for your research, please cite our paper and ⭐ this repo.

@inproceedings{song2026mvs2v,
  title     = {MV-S2V: Multi-View Subject-Consistent Video Generation},
  author    = {Song, Ziyang and Gong, Xinyu and Liu, Bangya and Zhao, Zelin},
  booktitle = {ACM SIGGRAPH},
  year      = {2026}
}

πŸ“§ Contact

For questions or collaboration, please open a GitHub issue or contact Ziyang Song.

Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for youngsong305/MV-S2V