MV-S2V: Multi-View Subject-Consistent Video Generation

MV-S2V: Multi-View Subject-Consistent Video Generation
Ziyang Song¹, Xinyu Gong², Bangya Liu³, Zelin Zhao⁴
¹The Hong Kong Polytechnic University ²The University of Texas at Austin ³University of Wisconsin–Madison ⁴Georgia Institute of Technology

SIGGRAPH 2026

📖 Overview

MV-S2V is the first framework that synthesizes videos from multiple reference views of the same subject to enforce 3D-level subject consistency. Existing subject-to-video methods condition on a single reference image and are therefore forced to hallucinate unseen details when generating novel views. MV-S2V tackles this by:

🧊 Multi-view conditioning — accepts an arbitrary number of reference views and produces a coherent video that respects all of them.
🌀 Temporally-Shifted RoPE (TS-RoPE) — a positional-encoding scheme that disambiguates cross-subject and cross-view references.
📦 Synthetic + real-world dataset — a data curation pipeline for high-quality multi-view subject-to-video training data.

The model in this repo is a 14B Subject-to-Video DiT built on top of Wan2.1.

📑 Todo List

Inference code
Pretrained 14B checkpoint
Benchmark dataset (MV-S2V-Bench)
Evaluation suite

⚡️ Quickstart

Installation

git clone https://github.com/szy-young/mv-s2v.git
cd mv-s2v
# Ensure torch >= 2.4.0
pip install -r requirements.txt
# For multi-GPU inference
pip install "xfuser>=0.4.1"

Model Download

Model	Resolution	Download
MV-S2V-14B	480p	🤗 HuggingFace

MV-S2V re-uses the VAE and T5 text encoder from Wan2.1-T2V-14B. Download the base checkpoint first:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B

Then place our DiT weights at ./checkpoints/diffusion_pytorch_model[.safetensors|.safetensors.index.json].

Run Multi-View Subject-to-Video Generation

The simplest way to reproduce all examples in this repo:

bash infer.sh

infer.sh exposes two paths to edit:

WAN_CKPT="./Wan2.1-T2V-14B"               # Wan2.1 base (VAE + T5)
DIT_CKPT="./checkpoints/diffusion_pytorch_model"  # MV-S2V DiT weights

Single-prompt inference (multi-GPU FSDP + xDiT USP)

torchrun --nproc_per_node=2 --master-port 14435 generate.py \
    --task s2v-14B \
    --size 640*640 \
    --frame_num 121 --sample_fps 24 \
    --ckpt_dir ${WAN_CKPT} \
    --phantom_ckpt ${DIT_CKPT} \
    --dit_fsdp --t5_fsdp --ulysses_size 2 \
    --ref_image "examples/object_images/box_multi_door_colored/object_0.png,examples/object_images/box_multi_door_colored/object_1.png,examples/object_images/box_multi_door_colored/object_2.png,examples/object_images/box_multi_door_colored/object_3.png" \
    --prompt "The video starts with a small, wooden activity cube with colorful panels and metal locks, topped with a teal plastic handle, sitting on a child's playroom floor with a soft rug and scattered toys in the background. As the camera smoothly orbits around the activity cube, the background gradually reveals a low bookshelf filled with children's books and puzzles on one side, and a painted wall adorned with cheerful animal decals and a cozy reading nook with cushions on the other." \
    --save_file examples/videos/box_multi_door_colored.mp4 \
    --base_seed 42 \
    --rpe_mode ts_rope \
    --sample_guide_scale_img 2.5

Key arguments

Flag	Description
`--ref_image`	Comma-separated paths to reference images. Mix object and human views freely (≤ 4 object views + 1 optional human view).
`--rpe_mode`	RoPE scheme for reference tokens: `vanilla`, `ss_rope` (spatial-shift), `ts_rope` (our Temporally-Shifted RoPE, recommended).
`--sample_guide_scale_img`	Image classifier-free guidance scale. We use 2.5 for object-centric and HOI scenes.
`--view_number`	Sub-sample views at inference time (`-1` = use all).
`--ulysses_size`	xDiT Ulysses parallel degree (set equal to `--nproc_per_node`).

🙏 Acknowledgements

This project builds on the open-source efforts of Wan2.1 and Phantom. We thank their authors for releasing high-quality video generation backbones.

⭐ Citation

If you find MV-S2V useful for your research, please cite our paper and ⭐ this repo.

@inproceedings{song2026mvs2v,
  title     = {MV-S2V: Multi-View Subject-Consistent Video Generation},
  author    = {Song, Ziyang and Gong, Xinyu and Liu, Bangya and Zhao, Zelin},
  booktitle = {ACM SIGGRAPH},
  year      = {2026}
}

📧 Contact

For questions or collaboration, please open a GitHub issue or contact Ziyang Song.

Downloads last month: 48

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for youngsong305/MV-S2V

MV-S2V: Multi-View Subject-Consistent Video Generation

Paper • 2601.17756 • Published May 4