Text-to-Video
Diffusers
Safetensors
I2VGenXLPipeline
image-to-video
StevenZhang commited on
Commit
b788d6f
·
1 Parent(s): 1486a26

Upload 10 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ doc/introduction.pdf filter=lfs diff=lfs merge=lfs -text
37
+ source/i2vgen_fig_04.png filter=lfs diff=lfs merge=lfs -text
README.MD ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # VGen
2
+
3
+
4
+ ![figure1](source/VGen.jpg "figure1")
5
+
6
+ VGen is an open-source video synthesis codebase developed by the Tongyi Lab of Alibaba Group, featuring state-of-the-art video generative models. This repository includes implementations of the following methods:
7
+
8
+
9
+ - [I2VGen-xl: High-quality image-to-video synthesis via cascaded diffusion models](https://i2vgen-xl.github.io/)
10
+ - [VideoComposer: Compositional Video Synthesis with Motion Controllability](https://videocomposer.github.io/)
11
+ - [Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation](https://higen-t2v.github.io/)
12
+ - [A Recipe for Scaling up Text-to-Video Generation with Text-free Videos]()
13
+ - [InstructVideo: Instructing Video Diffusion Models with Human Feedback]()
14
+ - [DreamVideo: Composing Your Dream Videos with Customized Subject and Motion](https://dreamvideo-t2v.github.io/)
15
+ - [VideoLCM: Video Latent Consistency Model](https://arxiv.org/abs/2312.09109)
16
+ - [Modelscope text-to-video technical report](https://arxiv.org/abs/2308.06571)
17
+
18
+
19
+ VGen can produce high-quality videos from the input text, images, desired motion, desired subjects, and even the feedback signals provided. It also offers a variety of commonly used video generation tools such as visualization, sampling, training, inference, join training using images and videos, acceleration, and more.
20
+
21
+
22
+ <a href='https://i2vgen-xl.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2311.04145'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/XUi0y7dxqEQ) <a href='https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441039979087.mp4'><img src='source/logo.png'></a>
23
+
24
+
25
+ ## 🔥News!!!
26
+ - __[2023.12]__ We release the high-efficiency video generation method [VideoLCM](https://arxiv.org/abs/2312.09109)
27
+ - __[2023.12]__ We release the code and model of I2VGen-XL and the ModelScope T2V
28
+ - __[2023.12]__ We release the T2V method [HiGen](https://higen-t2v.github.io) and customizing T2V method [DreamVideo](https://dreamvideo-t2v.github.io).
29
+ - __[2023.12]__ We write an [introduction docment](doc/introduction.pdf) for VGen and compare I2VGen-XL with SVD.
30
+ - __[2023.11]__ We release a high-quality I2VGen-XL model, please refer to the [Webpage](https://i2vgen-xl.github.io)
31
+
32
+
33
+ ## TODO
34
+ - [x] Release the technical papers and webpage of [I2VGen-XL](doc/i2vgen-xl.md)
35
+ - [x] Release the code and pretrained models that can generate 1280x720 videos
36
+ - [ ] Release models optimized specifically for the human body and faces
37
+ - [ ] Updated version can fully maintain the ID and capture large and accurate motions simultaneously
38
+ - [ ] Release other methods and the corresponding models
39
+
40
+
41
+ ## Preparation
42
+
43
+ The main features of VGen are as follows:
44
+ - Expandability, allowing for easy management of your own experiments.
45
+ - Completeness, encompassing all common components for video generation.
46
+ - Excellent performance, featuring powerful pre-trained models in multiple tasks.
47
+
48
+
49
+ ### Installation
50
+
51
+ ```
52
+ conda create -n vgen python=3.8
53
+ conda activate vgen
54
+ pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113
55
+ pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
56
+ ```
57
+
58
+ ### Datasets
59
+
60
+ We have provided a **demo dataset** that includes images and videos, along with their lists in ``data``.
61
+
62
+ *Please note that the demo images used here are for testing purposes and were not included in the training.*
63
+
64
+
65
+ ### Clone codeb
66
+
67
+ ```
68
+ git clone https://github.com/damo-vilab/i2vgen-xl.git
69
+ cd i2vgen-xl
70
+ ```
71
+
72
+
73
+ ## Getting Started with VGen
74
+
75
+ ### (1) Train your text-to-video model
76
+
77
+
78
+ Executing the following command to enable distributed training is as easy as that.
79
+ ```
80
+ python train_net.py --cfg configs/t2v_train.yaml
81
+ ```
82
+
83
+ In the `t2v_train.yaml` configuration file, you can specify the data, adjust the video-to-image ratio using `frame_lens`, and validate your ideas with different Diffusion settings, and so on.
84
+
85
+ - Before the training, you can download any of our open-source models for initialization. Our codebase supports custom initialization and `grad_scale` settings, all of which are included in the `Pretrain` item in yaml file.
86
+ - During the training, you can view the saved models and intermediate inference results in the `workspace/experiments/t2v_train`directory.
87
+
88
+ After the training is completed, you can perform inference on the model using the following command.
89
+ ```
90
+ python inference.py --cfg configs/t2v_infer.yaml
91
+ ```
92
+ Then you can find the videos you generated in the `workspace/experiments/test_img_01` directory. For specific configurations such as data, models, seed, etc., please refer to the `t2v_infer.yaml` file.
93
+
94
+ <!-- <table>
95
+ <center>
96
+ <tr>
97
+ <td ><center>
98
+ <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441754174077.mp4"></video>
99
+ </center></td>
100
+ <td ><center>
101
+ <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441138824052.mp4"></video>
102
+ </center></td>
103
+ </tr>
104
+ </center>
105
+ </table>
106
+ </center> -->
107
+
108
+ <table>
109
+ <center>
110
+ <tr>
111
+ <td ><center>
112
+ <image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01Ya2I5I25utrJwJ9Jf_!!6000000007587-2-tps-1280-720.png"></image>
113
+ </center></td>
114
+ <td ><center>
115
+ <image height="260" src="https://img.alicdn.com/imgextra/i3/O1CN01CrmYaz1zXBetmg3dd_!!6000000006723-2-tps-1280-720.png"></image>
116
+ </center></td>
117
+ </tr>
118
+ <tr>
119
+ <td ><center>
120
+ <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441754174077.mp4">HRER</a> to view the generated video.</p>
121
+ </center></td>
122
+ <td ><center>
123
+ <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441138824052.mp4">HRER</a> to view the generated video.</p>
124
+ </center></td>
125
+ </tr>
126
+ </center>
127
+ </table>
128
+ </center>
129
+
130
+
131
+ ### (2) Run the I2VGen-XL model
132
+
133
+ (i) Download model and test data:
134
+ ```
135
+ !pip install modelscope
136
+ from modelscope.hub.snapshot_download import snapshot_download
137
+ model_dir = snapshot_download('damo/I2VGen-XL', cache_dir='models/', revision='v1.0.0')
138
+ ```
139
+
140
+ (ii) Run the following command:
141
+ ```
142
+ python inference.py --cfg configs/i2vgen_xl_infer.yaml
143
+ ```
144
+ In a few minutes, you can retrieve the high-definition video you wish to create from the `workspace/experiments/test_img_01` directory. At present, we find that the current model performs inadequately on **anime images** and **images with a black background** due to the lack of relevant training data. We are consistently working to optimize it.
145
+
146
+
147
+ <span style="color:red">Due to the compression of our video quality in GIF format, please click 'HRER' below to view the original video.</span>
148
+
149
+ <center>
150
+ <table>
151
+ <center>
152
+ <tr>
153
+ <td ><center>
154
+ <image height="260" src="https://img.alicdn.com/imgextra/i1/O1CN01CCEq7K1ZeLpNQqrWu_!!6000000003219-0-tps-1280-720.jpg"></image>
155
+ </center></td>
156
+ <td ><center>
157
+ <!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442125067544.mp4"></video> -->
158
+ <image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01hIQcvG1spmQMLqBo0_!!6000000005816-1-tps-1280-704.gif"></image>
159
+ </center></td>
160
+ </tr>
161
+ <tr>
162
+ <td ><center>
163
+ <p>Input Image</p>
164
+ </center></td>
165
+ <td ><center>
166
+ <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442125067544.mp4">HRER</a> to view the generated video.</p>
167
+ </center></td>
168
+ </tr>
169
+ <tr>
170
+ <td ><center>
171
+ <image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01ZXY7UN23K8q4oQ3uG_!!6000000007236-2-tps-1280-720.png"></image>
172
+ </center></td>
173
+ <td ><center>
174
+ <!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441385957074.mp4"></video> -->
175
+ <image height="260" src="https://img.alicdn.com/imgextra/i1/O1CN01iaSiiv1aJZURUEY53_!!6000000003309-1-tps-1280-704.gif"></image>
176
+ </center></td>
177
+ </tr>
178
+ <tr>
179
+ <td ><center>
180
+ <p>Input Image</p>
181
+ </center></td>
182
+ <td ><center>
183
+ <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441385957074.mp4">HRER</a> to view the generated video.</p>
184
+ </center></td>
185
+ </tr>
186
+ <tr>
187
+ <td ><center>
188
+ <image height="260" src="https://img.alicdn.com/imgextra/i3/O1CN01NHpVGl1oat4H54Hjf_!!6000000005242-2-tps-1280-720.png"></image>
189
+ </center></td>
190
+ <td ><center>
191
+ <!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442102706767.mp4"></video> -->
192
+ <!-- <image muted="true" height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01DgLj1T240jfpzKoaQ_!!6000000007329-1-tps-1280-704.gif"></image>
193
+ -->
194
+ <image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01DgLj1T240jfpzKoaQ_!!6000000007329-1-tps-1280-704.gif"></image>
195
+ </center></td>
196
+ </tr>
197
+ <tr>
198
+ <td ><center>
199
+ <p>Input Image</p>
200
+ </center></td>
201
+ <td ><center>
202
+ <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442102706767.mp4">HRER</a> to view the generated video.</p>
203
+ </center></td>
204
+ </tr>
205
+ <tr>
206
+ <td ><center>
207
+ <image height="260" src="https://img.alicdn.com/imgextra/i1/O1CN01odS61s1WW9tXen21S_!!6000000002795-0-tps-1280-720.jpg"></image>
208
+ </center></td>
209
+ <td ><center>
210
+ <!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442163934688.mp4"></video> -->
211
+ <image height="260" src="https://img.alicdn.com/imgextra/i3/O1CN01Jyk1HT28JkZtpAtY6_!!6000000007912-1-tps-1280-704.gif"></image>
212
+ </center></td>
213
+ </tr>
214
+ <tr>
215
+ <td ><center>
216
+ <p>Input Image</p>
217
+ </center></td>
218
+ <td ><center>
219
+ <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442163934688.mp4">HRER</a> to view the generated video.</p>
220
+ </center></td>
221
+ </tr>
222
+ </center>
223
+ </table>
224
+ </center>
225
+
226
+ ### (3) Other methods
227
+
228
+ In preparation.
229
+
230
+
231
+ ## Customize your own approach
232
+
233
+ Our codebase essentially supports all the commonly used components in video generation. You can manage your experiments flexibly by adding corresponding registration classes, including `ENGINE, MODEL, DATASETS, EMBEDDER, AUTO_ENCODER, DISTRIBUTION, VISUAL, DIFFUSION, PRETRAIN`, and can be compatible with all our open-source algorithms according to your own needs. If you have any questions, feel free to give us your feedback at any time.
234
+
235
+
236
+
237
+ ## BibTeX
238
+
239
+ If this repo is useful to you, please cite our corresponding technical paper.
240
+
241
+
242
+ ```bibtex
243
+ @article{2023i2vgenxl,
244
+ title={I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models},
245
+ author={Zhang, Shiwei and Wang, Jiayu and Zhang, Yingya and Zhao, Kang and Yuan, Hangjie and Qing, Zhiwu and Wang, Xiang and Zhao, Deli and Zhou, Jingren},
246
+ booktitle={arXiv preprint arXiv:2311.04145},
247
+ year={2023}
248
+ }
249
+ @article{2023videocomposer,
250
+ title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
251
+ author={Wang, Xiang and Yuan, Hangjie and Zhang, Shiwei and Chen, Dayou and Wang, Jiuniu, and Zhang, Yingya, and Shen, Yujun, and Zhao, Deli and Zhou, Jingren},
252
+ booktitle={arXiv preprint arXiv:2306.02018},
253
+ year={2023}
254
+ }
255
+ @article{wang2023modelscope,
256
+ title={Modelscope text-to-video technical report},
257
+ author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
258
+ journal={arXiv preprint arXiv:2308.06571},
259
+ year={2023}
260
+ }
261
+ @article{dreamvideo,
262
+ title={DreamVideo: Composing Your Dream Videos with Customized Subject and Motion},
263
+ author={Wei, Yujie and Zhang, Shiwei and Qing, Zhiwu and Yuan, Hangjie and Liu, Zhiheng and Liu, Yu and Zhang, Yingya and Zhou, Jingren and Shan, Hongming},
264
+ journal={arXiv preprint arXiv:2312.04433},
265
+ year={2023}
266
+ }
267
+ @article{qing2023higen,
268
+ title={Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation},
269
+ author={Qing, Zhiwu and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Wei, Yujie and Zhang, Yingya and Gao, Changxin and Sang, Nong },
270
+ journal={arXiv preprint arXiv:2312.04483},
271
+ year={2023}
272
+ }
273
+ @article{wang2023videolcm,
274
+ title={VideoLCM: Video Latent Consistency Model},
275
+ author={Wang, Xiang and Zhang, Shiwei and Zhang, Han and Liu, Yu and Zhang, Yingya and Gao, Changxin and Sang, Nong },
276
+ journal={arXiv preprint arXiv:2312.09109},
277
+ year={2023}
278
+ }
279
+ ```
280
+
281
+ ## Disclaimer
282
+
283
+ This open-source model is trained with using [WebVid-10M](https://m-bain.github.io/webvid-dataset/) and [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) datasets and is intended for <strong>RESEARCH/NON-COMMERCIAL USE ONLY</strong>.
doc/.DS_Store ADDED
Binary file (6.15 kB). View file
 
doc/i2vgen-xl.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # I2VGen-XL
2
+
3
+ Official repo for [I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models](https://arxiv.org/abs/2311.04145)
4
+
5
+ Please see [Project Page](https://i2vgen-xl.github.io) for more examples.
6
+
7
+
8
+ ![method](../source/i2vgen_fig_02.jpg "method")
9
+
10
+
11
+ I2VGen-XL is capable of generating high-quality, realistically animated, and temporally coherent high-definition videos from a single input static image, based on user input.
12
+
13
+
14
+ *Our initial version has already been open-sourced on [Modelscope](https://modelscope.cn/models/damo/Image-to-Video/summary). This project focuses on improving the version, especially in terms of motions and semantics.*
15
+
16
+ ## Examples
17
+
18
+ ![figure2](../source/i2vgen_fig_04.png "figure2")
19
+
doc/introduction.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4d416283eb95212e1fd45c2d02045a836160929fd15e7120dd77998380c7656
3
+ size 4857845
source/VGen.jpg ADDED
source/fig_vs_vgen.jpg ADDED
source/i2vgen_fig_01.jpg ADDED
source/i2vgen_fig_02.jpg ADDED
source/i2vgen_fig_04.png ADDED

Git LFS Details

  • SHA256: 988fdb6e39703c08718892022656f4875806473aca90e4cec1bbf5cf59e75165
  • Pointer size: 132 Bytes
  • Size of remote file: 4.13 MB
source/logo.png ADDED