ali-vilab
/

MS-Image2Video

@@ -40,28 +40,29 @@ widgets:
           max_words: /
     task: image-to-video
 ---
-# Image-to-Video
-本项目**MS-Image2Video**旨在解决根据输入图像生成高清视频任务。**MS-Image2Video**由达摩院研发的高清视频生成基础模型，其核心部分包含两个阶段，分别解决语义一致性和清晰度的问题，参数量共计约37亿，模型经过在大规模视频和图像数据混合预训练，并在少量精品数据上微调得到，该数据分布广泛、类别多样化，模型对不同的数据均有良好的泛化性。项目于现有的视频生成模型，**MS-Image2Video**在清晰度、质感、语义、时序连续性等方面均具有明显的优势。
-此外，**MS-Image2Video**的许多设计理念继承于我们已经公开的工作**VideoComposer**，您可以参考我们的[VideoComposer](https://videocomposer.github.io)和本项目的Github代码库了解详细细节
-The **MS-Image2Video** project aims to address the task of generating high-definition videos based on input images. Developed by Alibaba Cloud, the **MS-Image2Video** is a fundamental model for generating high-definition videos. Its core components consist of two stages that address the issues of semantic consistency and clarity, totaling approximately 3.7 billion parameters. The model is pre-trained on a large-scale mix of video and image data and fine-tuned on a small number of high-quality data sets with a wide range of distributions and diverse categories. The model demonstrates good generalization capabilities for different data types. Compared to existing video generation models, **MS-Image2Video** has significant advantages in terms of clarity, texture, semantics, and temporal continuity.
-Additionally, many of the design concepts for **MS-Image2Video** are inherited from our publicly available work, **VideoComposer**. For detailed information, please refer to our [VideoComposer](https://videocomposer.github.io) and the Github code repository for this project.
 <center>
 <p align="center">
-    <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/image/Fig_twostage.png"/>
     <br/>
-    Fig.1 MS-Image2Video
 <p>
 </center>
 ## 模型介绍 (Introduction)
-**MS-Image2Video**建立在Stable Diffusion之上，如图Fig.2所示，通过专门设计的时空UNet在隐空间中进行时空建模并通过解码器重建出最终视频。为能够生成720P视频，我们将**MS-Image2Video**分为两个阶段，第一阶段保证语义一致性但低分辨率，第二阶段通过DDIM逆运算并在新的VLDM上进行去噪以提高视频分辨率以及同时提升时间和空间上的一致性。通过在模型、训练和数据上的联合优化，本项目主要具有以下几个特点：
 - 高清&宽屏，可以直接生成720P(1280*720)分辨率的视频，且相比于现有的开源项目，不仅分辨率得到有效提高，其生产的宽屏视频可以适合更多的场景
 - 无水印，模型通过我们内部大规模无水印视频/图像训练，并在高质量数据微调得到，生成的无水印视频可适用更多视频平台，减少许多限制
@@ -70,7 +71,7 @@ Additionally, many of the design concepts for **MS-Image2Video** are inherited f
 以下为生成的部分案例：
-**MS-Image2Video** is built on Stable Diffusion, as shown in Fig.2, and uses a specially designed spatiotemporal UNet to perform spatiotemporal modeling in the latent space, and then reconstructs the final video through the decoder. In order to generate 720P videos, **MS-Image2Video** is divided into two stages. The first stage guarantees semantic consistency but with low resolution, while the second stage uses the DDIM inverse operation and applies denoising on a new VLDM to improve the resolution and spatiotemporal consistency of the video. Through joint optimization of the model, training, and data, this project has the following characteristics:
 - High-definition & widescreen, can directly generate 720P (1280*720) resolution videos, and compared to existing open source projects, not only is the resolution effectively improved, but the widescreen videos it produces can also be suitable for more scenarios.
 - No watermark, the model is trained on a large-scale watermark-free video/image dataset internally and fine-tuned on high-quality data, generating watermark-free videos that can be applied to more video platforms and reducing many restrictions.
@@ -82,7 +83,7 @@ Below are some examples generated by the model:
 <center>
 <p align="center">
-    <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/image/fig1_overview.jpg"/>
     <br/>
     Fig.2 VLDM
 <p>
@@ -96,10 +97,10 @@ Below are some examples generated by the model:
 <table><center>
   <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/dragon2_rank_02-00-0021-001024.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/laoshu_rank_02-01-0810-001024.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -112,10 +113,10 @@ Below are some examples generated by the model:
   </tr>
     <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/ac10af0b1c524b778aff60be5b7ecc4f_2_02_00_0065_rank_02-00-1256-001024.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/ast_rank_02-00-0773-001024.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -128,10 +129,10 @@ Below are some examples generated by the model:
   </tr>
   <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/e3733444344741f1970cf2e92e617182_1_02_00_0199.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/b307dad96c3d440e80514b1b3f3be5fd_1_rank_02-00-0068-000000.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -144,10 +145,10 @@ Below are some examples generated by the model:
   </tr>
   <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/robot1_rank_02-01-0009-009999.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/d82ed4ad01034243ba88eaf9311c1edf_3_02_01_0193.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -160,10 +161,10 @@ Below are some examples generated by the model:
   </tr>
     <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/airship_0_rank_02-00-000000_rank_02-00-0653-001024.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/airship_1_rank_02-01-000000_rank_02-00-1428-001024.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -176,10 +177,10 @@ Below are some examples generated by the model:
   </tr>
   <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/0ba38f2f287f446dac8de87291073e0c_3_rank_02-01-0118-000000.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/03b401c825a2479eaf7b1b3252683a4b_3_02_00_0110_rank_02-00-1009-001024.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -192,10 +193,10 @@ Below are some examples generated by the model:
   </tr>
   <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/3e89356e6bd3470aaf3900b1b34c3ec2_0_rank_02-01-0126-000000.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/6fd21439fce644afa3a2e9b057956d0f_0000000_rank_02-01-0159-001024.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -208,10 +209,10 @@ Below are some examples generated by the model:
   </tr>
   <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/293fdf76aa404971b1fbb66baf9cbaac_1_02_00_0123_rank_02-00-0288-001024.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/426a7bee22034a88872dc8277ddbbf06_0_02_01_0023_rank_02-01-1090-001024.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -224,10 +225,10 @@ Below are some examples generated by the model:
   </tr>
   <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/a15bb09862b74b3c983a54b379912f81_0_02_00_0055_rank_02-01-0443-001024.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/7716d91802614bf9a99174c05bd08f32_3_02_01_0157_rank_02-01-1199-001024.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -240,10 +241,10 @@ Below are some examples generated by the model:
   </tr>
   <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/indian_rank_02-00-0800-001024.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/bike_rank_02-01-0007-001024.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -256,10 +257,10 @@ Below are some examples generated by the model:
   </tr>
   <tr>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/panda_rank_02-01-0007-009999.gif"/>
     </center></td>
     <td ><center>
-        <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/bf19a66dca0a47799923c47249982ffd_0000000_rank_02-01-0960-001024.gif"/>
     </center></td>
   </tr>
   <tr>
@@ -272,7 +273,8 @@ Below are some examples generated by the model:
   </tr>
 </table>
 </center>
 ### 依赖项 (Dependency)
@@ -286,14 +288,14 @@ sudo apt-get update && apt-get install ffmpeg libsm6 libxext6  -y
 ```
-其次，本**MS-Image2Video**项目适配ModelScope代码库，以下是本项目需要安装的部分依赖项。
-The **MS-Image2Video** project is compatible with the ModelScope codebase, and the following are some of the dependencies that need to be installed for this project.
 ```bash
-pip install modelscope==1.4.2
-pip install -U xformers
 pip install torch==2.0.1
 pip install open_clip_torch>=2.0.2
 pip install opencv-python-headless
@@ -304,6 +306,7 @@ pip install fairscale
 pip install scipy
 pip install imageio
 pip install pytorch-lightning
 ```
@@ -319,18 +322,40 @@ For more experiments, please stay tuned for our upcoming technical report and op
 from modelscope.pipelines import pipeline
 from modelscope.outputs import OutputKeys
-pipe = pipeline("image-to-video", 'damo/Image-to-Video')
 # IMG_PATH: your image path (url or local file)
 output_video_path = pipe(IMG_PATH, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
 print(output_video_path)
 ```
 ### 模型局限 (Limitation)
-本**MS-Image2Video**项目的模型在处理以下情况会存在局限性：
 - 小目标生成能力有限，在生成较小目标的时候，会存在一定的错误
 - 快速运动目标生成能力有限，当生成快速运动目标时，会存在一定的假象
 - 生成速度较慢，生成高清视频会明显导致生成速度减慢
@@ -338,7 +363,7 @@ print(output_video_path)
 此外，我们研究也发现，生成的视频空间上的质量和时序上的变化速度在一定程度上存在互斥现象，在本项目我们选择了其折中的模型，兼顾两则的平衡。
-The model of the **MS-Image2Video** project has limitations in the following scenarios:
 - Limited ability to generate small objects: There may be some errors when generating smaller objects.
 - Limited ability to generate fast-moving objects: There may be some artifacts when generating fast-moving objects.
 - Slow generation speed: Generating high-definition videos significantly slows down the generation speed.

           max_words: /
     task: image-to-video
 ---
+## 模型介绍 (Introduction)
+# I2VGen-XL高清图像生成视频大模型
+本项目**I2VGen-XL**旨在解决根据输入图像生成高清视频任务。**I2VGen-XL**由达摩院研发的高清视频生成基础模型，其核心部分包含两个阶段，分别解决语义一致性和清晰度的问题，参数量共计约37亿，模型经过在大规模视频和图像数据混合预训练，并在少量精品数据上微调得到，该数据分布广泛、类别多样化，模型对不同的数据均有良好的泛化性。项目于现有的视频生成模型，**I2VGen-XL**在清晰度、质感、语义、时序连续性等方面均具有明显的优势。
+此外，**I2VGen-XL**的许多设计理念继承于我们已经公开的工作**VideoComposer**，您可以参考我们的[VideoComposer](https://videocomposer.github.io)和本项目的Github代码库了解详细细节
+The **I2VGen-XL** project aims to address the task of generating high-definition videos based on input images. Developed by Alibaba Cloud, the **I2VGen-XL** is a fundamental model for generating high-definition videos. Its core components consist of two stages that address the issues of semantic consistency and clarity, totaling approximately 3.7 billion parameters. The model is pre-trained on a large-scale mix of video and image data and fine-tuned on a small number of high-quality data sets with a wide range of distributions and diverse categories. The model demonstrates good generalization capabilities for different data types. Compared to existing video generation models, **I2VGen-XL** has significant advantages in terms of clarity, texture, semantics, and temporal continuity.
+Additionally, many of the design concepts for **I2VGen-XL** are inherited from our publicly available work, **VideoComposer**. For detailed information, please refer to our [VideoComposer](https://videocomposer.github.io) and the Github code repository for this project.
 <center>
 <p align="center">
+    <img src="assets/image/Fig_twostage.png" style="max-width: none;"/>
     <br/>
+    Fig.1 I2VGen-XL
 <p>
 </center>
 ## 模型介绍 (Introduction)
+**I2VGen-XL**建立在Stable Diffusion之上，如图Fig.2所示，通过专门设计的时空UNet在隐空间中进行时空建模并通过解码器重建出最终视频。为能够生成720P视频，我们将**I2VGen-XL**分为两个阶段，第一阶段保���语义一致性但低分辨率，第二阶段通过DDIM逆运算并在新的VLDM上进行去噪以提高视频分辨率以及同时提升时间和空间上的一致性。通过在模型、训练和数据上的联合优化，本项目主要具有以下几个特点：
 - 高清&宽屏，可以直接生成720P(1280*720)分辨率的视频，且相比于现有的开源项目，不仅分辨率得到有效提高，其生产的宽屏视频可以适合更多的场景
 - 无水印，模型通过我们内部大规模无水印视频/图像训练，并在高质量数据微调得到，生成的无水印视频可适用更多视频平台，减少许多限制
 以下为生成的部分案例：
+**I2VGen-XL** is built on Stable Diffusion, as shown in Fig.2, and uses a specially designed spatiotemporal UNet to perform spatiotemporal modeling in the latent space, and then reconstructs the final video through the decoder. In order to generate 720P videos, **I2VGen-XL** is divided into two stages. The first stage guarantees semantic consistency but with low resolution, while the second stage uses the DDIM inverse operation and applies denoising on a new VLDM to improve the resolution and spatiotemporal consistency of the video. Through joint optimization of the model, training, and data, this project has the following characteristics:
 - High-definition & widescreen, can directly generate 720P (1280*720) resolution videos, and compared to existing open source projects, not only is the resolution effectively improved, but the widescreen videos it produces can also be suitable for more scenarios.
 - No watermark, the model is trained on a large-scale watermark-free video/image dataset internally and fine-tuned on high-quality data, generating watermark-free videos that can be applied to more video platforms and reducing many restrictions.
 <center>
 <p align="center">
+    <img src="assets/image/fig1_overview.jpg" style="max-width: none;"/>
     <br/>
     Fig.2 VLDM
 <p>
 <table><center>
   <tr>
     <td ><center>
+        <img src="assets/gif/dragon2_rank_02-00-0021-001024.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/laoshu_rank_02-01-0810-001024.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
     <tr>
     <td ><center>
+        <img src="assets/gif/ac10af0b1c524b778aff60be5b7ecc4f_2_02_00_0065_rank_02-00-1256-001024.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/ast_rank_02-00-0773-001024.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
   <tr>
     <td ><center>
+        <img src="assets/gif/e3733444344741f1970cf2e92e617182_1_02_00_0199.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/b307dad96c3d440e80514b1b3f3be5fd_1_rank_02-00-0068-000000.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
   <tr>
     <td ><center>
+        <img src="assets/gif/robot1_rank_02-01-0009-009999.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/d82ed4ad01034243ba88eaf9311c1edf_3_02_01_0193.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
     <tr>
     <td ><center>
+        <img src="assets/gif/airship_0_rank_02-00-000000_rank_02-00-0653-001024.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/airship_1_rank_02-01-000000_rank_02-00-1428-001024.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
   <tr>
     <td ><center>
+        <img src="assets/gif/0ba38f2f287f446dac8de87291073e0c_3_rank_02-01-0118-000000.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/03b401c825a2479eaf7b1b3252683a4b_3_02_00_0110_rank_02-00-1009-001024.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
   <tr>
     <td ><center>
+        <img src="assets/gif/3e89356e6bd3470aaf3900b1b34c3ec2_0_rank_02-01-0126-000000.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/6fd21439fce644afa3a2e9b057956d0f_0000000_rank_02-01-0159-001024.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
   <tr>
     <td ><center>
+        <img src="assets/gif/293fdf76aa404971b1fbb66baf9cbaac_1_02_00_0123_rank_02-00-0288-001024.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/426a7bee22034a88872dc8277ddbbf06_0_02_01_0023_rank_02-01-1090-001024.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
   <tr>
     <td ><center>
+        <img src="assets/gif/a15bb09862b74b3c983a54b379912f81_0_02_00_0055_rank_02-01-0443-001024.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/7716d91802614bf9a99174c05bd08f32_3_02_01_0157_rank_02-01-1199-001024.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
   <tr>
     <td ><center>
+        <img src="assets/gif/indian_rank_02-00-0800-001024.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/bike_rank_02-01-0007-001024.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
   <tr>
     <td ><center>
+        <img src="assets/gif/panda_rank_02-01-0007-009999.gif"/>
     </center></td>
     <td ><center>
+        <img src="assets/gif/bf19a66dca0a47799923c47249982ffd_0000000_rank_02-01-0960-001024.gif"/>
     </center></td>
   </tr>
   <tr>
   </tr>
 </table>
 </center>
+> [<font color="#dd0000">2023.08.25 更新</font>] ModelScope发布1.8.4版本，I2VGen-XL模型更新到模型参数文件 v1.1.0;
 ### 依赖项 (Dependency)
 ```
+其次，本**I2VGen-XL**项目适配ModelScope代码库，以下是本项目需要安装的部分依赖项。
+The **I2VGen-XL** project is compatible with the ModelScope codebase, and the following are some of the dependencies that need to be installed for this project.
 ```bash
+pip install modelscope==1.8.4
+pip install xformers==0.0.20
 pip install torch==2.0.1
 pip install open_clip_torch>=2.0.2
 pip install opencv-python-headless
 pip install scipy
 pip install imageio
 pip install pytorch-lightning
+pip install torchsde
 ```
 from modelscope.pipelines import pipeline
 from modelscope.outputs import OutputKeys
+pipe = pipeline(task='image-to-video', model='damo/Image-to-Video', model_revision='v1.1.0')
 # IMG_PATH: your image path (url or local file)
 output_video_path = pipe(IMG_PATH, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
 print(output_video_path)
 ```
+如果想生成超分视频的话, 示例见下:
+If you want to generate high-resolution video, please use the following code:
+```python
+from modelscope.pipelines import pipeline
+from modelscope.outputs import OutputKeys
+# if you only have one GPU, please make it's GPU memory bigger than 50G, or you can use two GPUs, and set them by device
+pipe1 = pipeline(task='image-to-video', model='damo/Image-to-Video', model_revision='v1.1.0', device='cuda:0')
+pipe2 = pipeline(task='video-to-video', model='damo/Video-to-Video', model_revision='v1.1.0', device='cuda:0')
+# image to video
+output_video_path = pipe1("test.jpg", output_video='./i2v_output.mp4')[OutputKeys.OUTPUT_VIDEO]
+# video resolution
+p_input = {'video_path': output_video_path}
+new_output_video_path = pipe2(p_input, output_video='./v2v_output.mp4')[OutputKeys.OUTPUT_VIDEO]
+```
+更多超分细节, 请访问 <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">Video-to-Video</a>。 我们也提供了用户接口，请移步<a href="https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary">I2VGen-XL-Demo</a>。
+Please visit <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">Video-to-Video</a> for more details. We also provide user interface:<a href="https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary">I2VGen-XL-Demo</a>.
 ### 模型局限 (Limitation)
+本**I2VGen-XL**项目的模型在处理以下情况会存在局限性：
 - 小目标生成能力有限，在生成较小目标的时候，会存在一定的错误
 - 快速运动目标生成能力有限，当生成快速运动目标时，会存在一定的假象
 - 生成速度较慢，生成高清视频会明显导致生成速度减慢
 此外，我们研究也发现，生成的视频空间上的质量和时序上的变化速度在一定程度上存在互斥现象，在本项目我们选择了其折中的模型，兼顾两则的平衡。
+The model of the **I2VGen-XL** project has limitations in the following scenarios:
 - Limited ability to generate small objects: There may be some errors when generating smaller objects.
 - Limited ability to generate fast-moving objects: There may be some artifacts when generating fast-moving objects.
 - Slow generation speed: Generating high-definition videos significantly slows down the generation speed.