kylielee505
commited on
Commit
•
b9de419
1
Parent(s):
1e01ce9
Upload folder using huggingface_hub
Browse files- .gitattributes +0 -1
- ast_indexer +0 -0
- hub/damo/text-to-video-synthesis/.mdl +0 -0
- hub/damo/text-to-video-synthesis/.msc +0 -0
- hub/damo/text-to-video-synthesis/README.md +105 -0
- hub/damo/text-to-video-synthesis/VQGAN_autoencoder.pth +3 -0
- hub/damo/text-to-video-synthesis/configuration.json +34 -0
- hub/damo/text-to-video-synthesis/open_clip_pytorch_model.bin +3 -0
- hub/damo/text-to-video-synthesis/text2video_pytorch_model.pth +3 -0
.gitattributes
CHANGED
@@ -25,7 +25,6 @@
|
|
25 |
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
-
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
*.wasm filter=lfs diff=lfs merge=lfs -text
|
|
|
25 |
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
|
|
28 |
*.tflite filter=lfs diff=lfs merge=lfs -text
|
29 |
*.tgz filter=lfs diff=lfs merge=lfs -text
|
30 |
*.wasm filter=lfs diff=lfs merge=lfs -text
|
ast_indexer
ADDED
The diff for this file is too large to render.
See raw diff
|
|
hub/damo/text-to-video-synthesis/.mdl
ADDED
Binary file (51 Bytes). View file
|
|
hub/damo/text-to-video-synthesis/.msc
ADDED
Binary file (403 Bytes). View file
|
|
hub/damo/text-to-video-synthesis/README.md
ADDED
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tasks:
|
3 |
+
- text-to-video-synthesis
|
4 |
+
widgets:
|
5 |
+
- task: text-to-video-synthesis
|
6 |
+
inputs:
|
7 |
+
- type: text
|
8 |
+
name: text
|
9 |
+
title: 输入英文prompt
|
10 |
+
validator:
|
11 |
+
max_words: 75
|
12 |
+
examples:
|
13 |
+
- name: 1
|
14 |
+
title: 示例1
|
15 |
+
inputs:
|
16 |
+
- name: text
|
17 |
+
data: A panda eating bamboo on a rock.
|
18 |
+
inferencespec:
|
19 |
+
cpu: 4
|
20 |
+
memory: 16000
|
21 |
+
gpu: 1
|
22 |
+
gpu_memory: 32000
|
23 |
+
domain:
|
24 |
+
- multi-modal
|
25 |
+
frameworks:
|
26 |
+
- pytorch
|
27 |
+
backbone:
|
28 |
+
- diffusion
|
29 |
+
metrics:
|
30 |
+
- realism
|
31 |
+
- text-video similarity
|
32 |
+
license: Apache License 2.0
|
33 |
+
tags:
|
34 |
+
- text2video generation
|
35 |
+
- diffusion model
|
36 |
+
- 文到视频
|
37 |
+
- 文生视频
|
38 |
+
- 文本生成视频
|
39 |
+
- 生成
|
40 |
+
---
|
41 |
+
|
42 |
+
# 文本生成视频大模型-英文-通用领域
|
43 |
+
|
44 |
+
本模型基于多阶段文本到视频生成扩散模型, 输入描述文本,返回符合文本描述的视频。仅支持英文输入。
|
45 |
+
|
46 |
+
## 模型描述
|
47 |
+
文本到视频生成扩散模型由文本特征提取、文本特征到视频隐空间扩散模型、视频隐空间到视频视觉空间这3个子网络组成,整体模型参数约17亿。支持英文输入。扩散模型采用Unet3D结构,通过从纯高斯噪声视频中,迭代去噪的过程,实现视频生成的功能。
|
48 |
+
|
49 |
+
### 期望模型使用方式以及适用范围
|
50 |
+
|
51 |
+
本模型适用范围较广,能基于任意英文文本描述进行推理,生成视频。
|
52 |
+
|
53 |
+
### 如何使用
|
54 |
+
|
55 |
+
在ModelScope框架下,通过调用简单的Pipeline即可使用当前模型,其中,输入需为字典格式,合法键值为'text',内容为一小段文本。该模型暂仅支持在GPU上进行推理。输入具体代码示例如下:
|
56 |
+
|
57 |
+
#### 补充运行环境
|
58 |
+
```shell
|
59 |
+
pip install open_clip_torch
|
60 |
+
```
|
61 |
+
|
62 |
+
#### 代码范例
|
63 |
+
```python
|
64 |
+
from modelscope.pipelines import pipeline
|
65 |
+
from modelscope.outputs import OutputKeys
|
66 |
+
|
67 |
+
p = pipeline('text-to-video-synthesis', 'damo/text-to-video-synthesis')
|
68 |
+
test_text = {
|
69 |
+
'text': 'A panda eating bamboo on a rock.',
|
70 |
+
}
|
71 |
+
output_video_path = p(test_text,)[OutputKeys.OUTPUT_VIDEO]
|
72 |
+
print('output_video_path:', output_video_path)
|
73 |
+
|
74 |
+
```
|
75 |
+
|
76 |
+
### 模型局限性以及可能的偏差
|
77 |
+
|
78 |
+
* 模型基于Webvid等公开数据集进行训练,生成结果可能会存在与训练数据分布相关的偏差。
|
79 |
+
* 该模型无法实现完美的影视级生成。
|
80 |
+
* 该模型无法生成清晰的文本。
|
81 |
+
* 该模型主要是用英文语料训练的,暂不支持其他语言。
|
82 |
+
* 该模型在复杂的组合性生成任务上表现有待提升。
|
83 |
+
|
84 |
+
### 滥用、恶意使用和超出范围的使用
|
85 |
+
* 该模型未经过训练以真实地表示人或事件,因此使用该模型生成此类内容超出了该模型的能力范围。
|
86 |
+
* 禁止用于对人或其环境、文化、宗教等产生贬低、或有害的内容生成。
|
87 |
+
* 禁止用于涉黄、暴力和血腥内容生成。
|
88 |
+
* 禁止用于错误和虚假信息生成。
|
89 |
+
|
90 |
+
## 训练数据介绍
|
91 |
+
|
92 |
+
训练数据包括LAION5B, ImageNet, Webvid等公开数据集。经过美学得分、水印得分、去重等预训练进行图像和视频过滤。
|
93 |
+
|
94 |
+
## 相关论文以及引用信息
|
95 |
+
|
96 |
+
```BibTeX
|
97 |
+
@misc{rombach2021highresolution,
|
98 |
+
title={High-Resolution Image Synthesis with Latent Diffusion Models},
|
99 |
+
author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
|
100 |
+
year={2021},
|
101 |
+
eprint={2112.10752},
|
102 |
+
archivePrefix={arXiv},
|
103 |
+
primaryClass={cs.CV}
|
104 |
+
}
|
105 |
+
```
|
hub/damo/text-to-video-synthesis/VQGAN_autoencoder.pth
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:88ecb782561455673c4b78d05093494b9c539fc6bfc08f3a9a4a0dd7b0b10f36
|
3 |
+
size 5214865159
|
hub/damo/text-to-video-synthesis/configuration.json
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{ "framework": "pytorch",
|
2 |
+
"task": "text-to-video-synthesis",
|
3 |
+
"model": {
|
4 |
+
"type": "latent-text-to-video-synthesis",
|
5 |
+
"model_args": {
|
6 |
+
"ckpt_clip": "open_clip_pytorch_model.bin",
|
7 |
+
"ckpt_unet": "text2video_pytorch_model.pth",
|
8 |
+
"ckpt_autoencoder": "VQGAN_autoencoder.pth",
|
9 |
+
"max_frames": 16,
|
10 |
+
"tiny_gpu": 1
|
11 |
+
},
|
12 |
+
"model_cfg": {
|
13 |
+
"unet_in_dim": 4,
|
14 |
+
"unet_dim": 320,
|
15 |
+
"unet_y_dim": 768,
|
16 |
+
"unet_context_dim": 1024,
|
17 |
+
"unet_out_dim": 4,
|
18 |
+
"unet_dim_mult": [1, 2, 4, 4],
|
19 |
+
"unet_num_heads": 8,
|
20 |
+
"unet_head_dim": 64,
|
21 |
+
"unet_res_blocks": 2,
|
22 |
+
"unet_attn_scales": [1, 0.5, 0.25],
|
23 |
+
"unet_dropout": 0.1,
|
24 |
+
"temporal_attention": "True",
|
25 |
+
"num_timesteps": 1000,
|
26 |
+
"mean_type": "eps",
|
27 |
+
"var_type": "fixed_small",
|
28 |
+
"loss_type": "mse"
|
29 |
+
}
|
30 |
+
},
|
31 |
+
"pipeline": {
|
32 |
+
"type": "latent-text-to-video-synthesis"
|
33 |
+
}
|
34 |
+
}
|
hub/damo/text-to-video-synthesis/open_clip_pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9a78ef8e8c73fd0df621682e7a8e8eb36c6916cb3c16b291a082ecd52ab79cc4
|
3 |
+
size 3944692325
|
hub/damo/text-to-video-synthesis/text2video_pytorch_model.pth
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d9609d02717b799137a97244844ab6df0d1a071568a1d24dcb62d9050f3a24a3
|
3 |
+
size 5645549049
|