File size: 7,340 Bytes
dc306f2
 
 
14b6844
dc306f2
56922e3
 
 
1ea8a50
56922e3
 
 
35cc203
56922e3
 
 
1ea8a50
 
56922e3
 
260d5ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc306f2
 
dbde3f9
dc306f2
 
ea8ea0a
dc306f2
1ea8a50
dc306f2
1ea8a50
dc306f2
 
 
dbde3f9
dc306f2
1ea8a50
 
 
dc306f2
 
 
897796d
dc306f2
65297ba
dc306f2
65297ba
 
 
 
 
 
 
dc306f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4bd4af9
8e33ffd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
license: apache-2.0
---
# Kandinsky 4.0 T2V Flash: Text-to-Video diffusion model

<br><br><br><br>

<div align="center">
  <image src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/Mi3ugli7f1MNNVWC5gzMS.png" ></image>
</div>

<div align="center">
  <a href="https://habr.com/ru/companies/sberbank/articles/866156/">Kandinsky 4.0 Post</a> | <a href=https://ai-forever.github.io/Kandinsky-4/K40/>Project Page</a> | <a href=https://huggingface.co/spaces/ai-forever/kandinsky-4-t2v-flash>Generate</a> | <a>Technical Report</a> | <a href=https://github.com/ai-forever/Kandinsky-4>GitHub</a> | <a href=https://huggingface.co/ai-forever/kandinsky-4-t2v-flash> Kandinsky 4.0 T2V Flash HuggingFace</a> | <a href=https://huggingface.co/ai-forever/kandinsky-4-v2a> Kandinsky 4.0 V2A HuggingFace</a>
</div>




<br><br><br><br>

<table border="0" style="width: 200; text-align: left; margin-top: 20px;">
  <tr>
      <td>
          <video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/iDiG7YNd8c0HiMqfYxqZb.mp4" width=200 controls autoplay loop></video>
      </td>
      <td>
          <video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/vU8N8Xo6D32VJBUXxtdjg.mp4" width=200 controls autoplay loop></video>
      </td>
      <td>
          <video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/L3qsU4Ug0fuv2doHUXbUD.mp4" width=200 controls autoplay loop></video>
      </td>
  </tr>

  <tr>
      <td>
          <video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/GSeOun-dqJZkVn-rOWT7Q.mp4" width=200 controls autoplay loop></video>
      </td>
      <td>
          <video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/9qWVI6J_DgaMH0_D6-rAF.mp4" width=200 controls autoplay loop></video>
      </td>
      <td>
          <video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/jTjMlVzbDSK5mmRLfv3DZ.mp4" width=200 controls autoplay loop></video>
      </td>
  </tr>


  <tr>
      <td>
          <video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/c3QrKiwTkIRGpj4Dr3czP.mp4" width=200 controls autoplay loop></video>
      </td>
      <td>
          <video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/4N6QhLRhQLHU_jMLUbkDa.mp4" width=200 controls autoplay loop></video>
      </td>
      <td>
          <video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/IHHkpSWZ86PCBhJBhctEJ.mp4" width=200 controls autoplay loop></video>
      </td>
  </tr>
</table>



## Description:

Kandinsky 4.0 T2V Flash is a text-to-video generation model based on latent diffusion for 480p resolution, that can generate **12 second videos** in 480p resolution in **11 seconds** on a single NVIDIA H100 gpu. The pipeline consist of 3D causal [CogVideoX](https://arxiv.org/pdf/2408.06072) VAE, text embedder [T5-V1.1-XXL](https://huggingface.co/google/t5-v1_1-xxl) and our trained MMDiT-like transformer model.


<img src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/W09Zl2q-TbLYV4xUE3Bte.png" height=500>

A serious problem for all diffusion models, and especially video generation models, is the generation speed. To solve this problem, we used the Latent Adversarial Diffusion Distillation (LADD) approach, proposed for distilling image generation models and first described in the [article](https://arxiv.org/pdf/2403.12015) from Stability AI and tested by us when training the [Kandinsky 3.1](https://github.com/ai-forever/Kandinsky-3) image generation model. The distillation pipeline itself involves additional training of the diffusion model in the GAN pipeline, i.e. joint training of the diffusion generator with the discriminator.

<img src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/A9l81FaWiBJl5Vx06XbIc.png">

## Architecture

For training Kandinsky 4.0 T2V Flash we used the following architecture of diffusion transformer, based on MMDiT proposed in [Stable Diffusion 3](https://arxiv.org/pdf/2403.03206).



<img src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/UjY8BqRUJ_H0lkgb_PKNY.png"> <img src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/fiMHO1CoR8JQjRXqXNE8k.png">

For training flash version we used the following architecture of discriminator. Discriminator head structure resembles half of an MMDiT block.

<img src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/3H0JG0GvAvraesgBrgKTK.png"> <img src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/GR6losxCXvlmgg4zSNTuV.png">

## Installation

```bash
git clone https://github.com/ai-forever/Kandinsky-4.git
cd Kandinsky-4
pip install -r kandinsky4_video2audio/requirements.txt
```

## Inference:
```python
import torch
from IPython.display import Video
from kandinsky import get_T2V_pipeline

device_map = {
    "dit": torch.device('cuda:0'), 
    "vae": torch.device('cuda:0'), 
    "text_embedder": torch.device('cuda:0')
}

pipe = get_T2V_pipeline(device_map)

images = pipe(
    seed=42,
    time_length=12,
    width = 672,
    height = 384,
    save_path="./test.mp4",
    text="Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance",
)

Video("./test.mp4")
```

Examples of usage and more detailed parameters description are in the [examples.ipynb](https://github.com/ai-forever/Kandinsky-4/examples.ipynb) notebook.

Make sure that you have weights folder with weights of all models.

We also add distributed inference opportunity: [run_inference_distil.py](https://github.com/ai-forever/Kandinsky-4/run_inference_distil.py)

To run this examples:
```
python -m torch.distributed.launch --nnodes n --nproc-per-node m run_inference_distil.py
```
where n is a number of nodes you have and m is a number of gpus on these nodes. The code was tested with n=1 and m=8, so this is preferable parameters.

In distributed setting the DiT models are parallelized using tensor parallel on all gpus, which enables a significant speedup.

To run this examples from terminal without tensor parallel:
```
python run_inference_distil.py
```


# Authors

### Project Leader 

Denis Dimitrov

### Scientific Consultants

Andrey Kuznetsov, Sergey Markov

### Training Pipeline & Model Pretrain & Model Distillation

Vladimir Arkhipkin, Novitskiy Lev, Maria Kovaleva

### Model Architecture

Vladimir Arkhipkin, Maria Kovaleva, Zein Shaheen, Arsen Kuzhamuratov, Nikolay Gerasimenko, Mikhail Zhirnov, Alexandr Gambashidze, Konstantin Sobolev

### Data Pipeline

Ivan Kirillov, Andrei Shutkin, Kirill Chernishev, Julia Agafonova, Denis Parkhomenko

### Video-to-audio model

Zein Shaheen, Arseniy Shakhmatov, Denis Parkhomenko

### Quality Assessment

Nikolay Gerasimenko, Anna Averchenkova, Victor Panshin, Vladislav Veselov, Pavel Perminov, Vladislav Rodionov, Sergey Skachkov, Stepan Ponomarev

### Other Contributors

Viacheslav Vasilev, Andrei Filatov, Gregory Leleytner