camenduru commited on
Commit
f41002b
1 Parent(s): 9d5d436

thanks to damo-vilab ❤

Browse files
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ tags:
4
+ - text-to-video
5
+ duplicated_from: diffusers/text-to-video-ms-1.7b
6
+ ---
7
+
8
+ # Text-to-video-synthesis Model in Open Domain
9
+
10
+ This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.
11
+
12
+ **We Are Hiring!** (Based in Beijing / Hangzhou, China.)
13
+
14
+ If you're looking for an exciting challenge and the opportunity to work with cutting-edge technologies in AIGC and large-scale pretraining, then we are the place for you. We are looking for talented, motivated and creative individuals to join our team. If you are interested, please send your CV to us.
15
+
16
+ EMAIL: yingya.zyy@alibaba-inc.com
17
+
18
+ ## Model description
19
+
20
+ The text-to-video generation diffusion model consists of three sub-networks: text feature extraction model, text feature-to-video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. Currently, it only supports English input. The diffusion model adopts a UNet3D structure, and implements video generation through the iterative denoising process from the pure Gaussian noise video.
21
+
22
+ This model is meant for research purposes. Please look at the [model limitations and biases and misuse](#model-limitations-and-biases), [malicious use and excessive use](#misuse-malicious-use-and-excessive-use) sections.
23
+
24
+ ## Model Details
25
+
26
+ - **Developed by:** [ModelScope](https://modelscope.cn/)
27
+ - **Model type:** Diffusion-based text-to-video generation model
28
+ - **Language(s):** English
29
+ - **License:**[ CC-BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)
30
+ - **Resources for more information:** [ModelScope GitHub Repository](https://github.com/modelscope/modelscope), [Summary](https://modelscope.cn/models/damo/text-to-video-synthesis/summary).
31
+ - **Cite as:**
32
+
33
+ ## Use cases
34
+
35
+ This model has a wide range of applications and can reason and generate videos based on arbitrary English text descriptions.
36
+
37
+ ## Usage
38
+
39
+ Let's first install the libraries required:
40
+
41
+ ```bash
42
+ $ pip install diffusers transformers accelerate
43
+ ```
44
+
45
+ Now, generate a video:
46
+
47
+ ```python
48
+ import torch
49
+ from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
50
+ from diffusers.utils import export_to_video
51
+
52
+ pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
53
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
54
+ pipe.enable_model_cpu_offload()
55
+
56
+ prompt = "Spiderman is surfing"
57
+ video_frames = pipe(prompt, num_inference_steps=25).frames
58
+ video_path = export_to_video(video_frames)
59
+ ```
60
+
61
+ Here are some results:
62
+
63
+ <table>
64
+ <tr>
65
+ <td><center>
66
+ An astronaut riding a horse.
67
+ <br>
68
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astr.gif"
69
+ alt="An astronaut riding a horse."
70
+ style="width: 300px;" />
71
+ </center></td>
72
+ <td ><center>
73
+ Darth vader surfing in waves.
74
+ <br>
75
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vader.gif"
76
+ alt="Darth vader surfing in waves."
77
+ style="width: 300px;" />
78
+ </center></td>
79
+ </tr>
80
+ </table>
81
+
82
+ ## Long Video Generation
83
+
84
+ You can optimize for memory usage by enabling attention and VAE slicing and using Torch 2.0.
85
+ This should allow you to generate videos up to 25 seconds on less than 16GB of GPU VRAM.
86
+
87
+ ```bash
88
+ $ pip install git+https://github.com/huggingface/diffusers transformers accelerate
89
+ ```
90
+
91
+ ```py
92
+ import torch
93
+ from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
94
+ from diffusers.utils import export_to_video
95
+
96
+ # load pipeline
97
+ pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
98
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
99
+
100
+ # optimize for GPU memory
101
+ pipe.enable_model_cpu_offload()
102
+ pipe.enable_vae_slicing()
103
+
104
+ # generate
105
+ prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman"
106
+ video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames
107
+
108
+ # convent to video
109
+ video_path = export_to_video(video_frames)
110
+ ```
111
+
112
+
113
+ ## View results
114
+
115
+ The above code will display the save path of the output video, and the current encoding format can be played with [VLC player](https://www.videolan.org/vlc/).
116
+
117
+ The output mp4 file can be viewed by [VLC media player](https://www.videolan.org/vlc/). Some other media players may not view it normally.
118
+
119
+ ## Model limitations and biases
120
+
121
+ * The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data.
122
+ * This model cannot achieve perfect film and television quality generation.
123
+ * The model cannot generate clear text.
124
+ * The model is mainly trained with English corpus and does not support other languages ​​at the moment**.
125
+ * The performance of this model needs to be improved on complex compositional generation tasks.
126
+
127
+ ## Misuse, Malicious Use and Excessive Use
128
+
129
+ * The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities.
130
+ * It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc.
131
+ * Prohibited for pornographic, violent and bloody content generation.
132
+ * Prohibited for error and false information generation.
133
+
134
+ ## Training data
135
+
136
+ The training data includes [LAION5B](https://huggingface.co/datasets/laion/laion2B-en), [ImageNet](https://www.image-net.org/), [Webvid](https://m-bain.github.io/webvid-dataset/) and other public datasets. Image and video filtering is performed after pre-training such as aesthetic score, watermark score, and deduplication.
137
+
138
+ _(Part of this model card has been taken from [here](https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis))_
139
+
140
+ ## Citation
141
+
142
+ ```bibtex
143
+ @InProceedings{VideoFusion,
144
+ author = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
145
+ title = {VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
146
+ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
147
+ month = {June},
148
+ year = {2023}
149
+ }
150
+ ```
model_index.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "TextToVideoSDPipeline",
3
+ "_diffusers_version": "0.15.0.dev0",
4
+ "scheduler": [
5
+ "diffusers",
6
+ "DDIMScheduler"
7
+ ],
8
+ "text_encoder": [
9
+ "transformers",
10
+ "CLIPTextModel"
11
+ ],
12
+ "tokenizer": [
13
+ "transformers",
14
+ "CLIPTokenizer"
15
+ ],
16
+ "unet": [
17
+ "diffusers",
18
+ "UNet3DConditionModel"
19
+ ],
20
+ "vae": [
21
+ "diffusers",
22
+ "AutoencoderKL"
23
+ ]
24
+ }
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "DDIMScheduler",
3
+ "_diffusers_version": "0.15.0.dev0",
4
+ "beta_end": 0.012,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "clip_sample": false,
8
+ "clip_sample_range": 1.0,
9
+ "dynamic_thresholding_ratio": 0.995,
10
+ "num_train_timesteps": 1000,
11
+ "prediction_type": "epsilon",
12
+ "sample_max_value": 1.0,
13
+ "set_alpha_to_one": false,
14
+ "skip_prk_steps": true,
15
+ "steps_offset": 1,
16
+ "thresholding": false,
17
+ "trained_betas": null
18
+ }
text_encoder/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/home/patrick_huggingface_co/ms-text-to-video-sd/text_encoder",
3
+ "architectures": [
4
+ "CLIPTextModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dropout": 0.0,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_size": 1024,
12
+ "initializer_factor": 1.0,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 77,
17
+ "model_type": "clip_text_model",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 23,
20
+ "pad_token_id": 1,
21
+ "projection_dim": 512,
22
+ "torch_dtype": "float16",
23
+ "transformers_version": "4.27.0.dev0",
24
+ "vocab_size": 49408
25
+ }
text_encoder/model.fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6ab158327d06ce861b5c78843672c78433ea82fcf03f142097ba204b81251cd2
3
+ size 680821102
text_encoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e4aa519f64dc6386f88221a66c106a09fa027b47a20cc0e126687695f2a6669
3
+ size 1361597016
text_encoder/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2188379b05015f531d61503e714234d00a64939792f3098b324e516547f0194f
3
+ size 1361674657
text_encoder/pytorch_model.fp16.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7bb11b1da63986aaaaefb5ef2100d34109c024ac640cacd9ed697150c1c57f01
3
+ size 680900852
tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "!",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<|startoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "do_lower_case": true,
12
+ "eos_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "errors": "replace",
21
+ "model_max_length": 77,
22
+ "pad_token": "<|endoftext|>",
23
+ "special_tokens_map_file": "./special_tokens_map.json",
24
+ "tokenizer_class": "CLIPTokenizer",
25
+ "unk_token": {
26
+ "__type": "AddedToken",
27
+ "content": "<|endoftext|>",
28
+ "lstrip": false,
29
+ "normalized": true,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
unet/config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNet3DConditionModel",
3
+ "_diffusers_version": "0.15.0.dev0",
4
+ "_name_or_path": "/home/patrick_huggingface_co/.cache/huggingface/hub/models--damo-vilab--text-to-video-ms-1.7b/snapshots/32aa057809b033d6f3ca31da4b0ade1fa7904654/unet",
5
+ "act_fn": "silu",
6
+ "attention_head_dim": 64,
7
+ "block_out_channels": [
8
+ 320,
9
+ 640,
10
+ 1280,
11
+ 1280
12
+ ],
13
+ "cross_attention_dim": 1024,
14
+ "down_block_types": [
15
+ "CrossAttnDownBlock3D",
16
+ "CrossAttnDownBlock3D",
17
+ "CrossAttnDownBlock3D",
18
+ "DownBlock3D"
19
+ ],
20
+ "in_channels": 4,
21
+ "layers_per_block": 2,
22
+ "norm_eps": 1e-05,
23
+ "norm_num_groups": 32,
24
+ "out_channels": 4,
25
+ "sample_size": 32,
26
+ "up_block_types": [
27
+ "UpBlock3D",
28
+ "CrossAttnUpBlock3D",
29
+ "CrossAttnUpBlock3D",
30
+ "CrossAttnUpBlock3D"
31
+ ]
32
+ }
unet/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a6f519b8cca193d6b15e9e33f5011dcb670de1ee8a9a12e8498ca586a7307651
3
+ size 5645561389
unet/diffusion_pytorch_model.fp16.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f094ebff52035921dacc07b5c4d495b744f4e1336c6cac96dd8f38a2a28a123c
3
+ size 2823097059
unet/diffusion_pytorch_model.fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f972e78d39735baa09aab1cf60dd19f213e32f1c5a89382385e1a7f647b2e537
3
+ size 2822650226
unet/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1296dbed3928b7ecf7da12dda03c4ff10f1cdf7303ff86993529fe8388a3e333
3
+ size 5645118394
vae/config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.15.0.dev0",
4
+ "_name_or_path": "/home/patrick_huggingface_co/ms-text-to-video-sd/vae",
5
+ "act_fn": "silu",
6
+ "block_out_channels": [
7
+ 128,
8
+ 256,
9
+ 512,
10
+ 512
11
+ ],
12
+ "down_block_types": [
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D",
16
+ "DownEncoderBlock2D"
17
+ ],
18
+ "in_channels": 3,
19
+ "latent_channels": 4,
20
+ "layers_per_block": 2,
21
+ "norm_num_groups": 32,
22
+ "out_channels": 3,
23
+ "sample_size": 512,
24
+ "scaling_factor": 0.18215,
25
+ "up_block_types": [
26
+ "UpDecoderBlock2D",
27
+ "UpDecoderBlock2D",
28
+ "UpDecoderBlock2D",
29
+ "UpDecoderBlock2D"
30
+ ]
31
+ }
vae/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:36bb8e1b54aba3a0914eb35fba13dcb107e9f18d379d1df2158732cd4bf56a94
3
+ size 334711857
vae/diffusion_pytorch_model.fp16.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c13979094c6566d9aa5936879055457022f34a747eeba12542504853385077c8
3
+ size 167405395
vae/diffusion_pytorch_model.fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c8b011e5a18c53888d51a81aa28223ddec87b450c14dc9650d9c3ebbcd17624e
3
+ size 167335350
vae/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1d993488569e928462932c8c38a0760b874d166399b14414135bd9c42df5815
3
+ size 334643276