ai-forever commited on
Commit
dc306f2
1 Parent(s): 947b728

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # Kandinsky-4 flash: Text-to-Video diffusion model
5
+
6
+ ![]()
7
+
8
+ [Kandinsky 4.0 Post]() | [Project Page]() | [Generate]() | [Telegram-bot]() | [Technical Report]() | [GitHub](https://github.com/ai-forever/Kandinsky-4) | [HuggingFace](https://huggingface.co/ai-forever/kandinsky4) |
9
+
10
+ ## Description:
11
+
12
+ Kandinsky 4.0 is a text-to-video generation model based on latent diffusion for 480p and HD resolutions. Here we present distiled version of this model Kandisnly 4 flash, that can generate 12 second videos in 480p resolution in 11 seconds on single gpu. The pipeline consist of 3D causal [CogVideoX](https://arxiv.org/pdf/2408.06072) VAE, text embedder [T5-V1.1-XXL](https://huggingface.co/google/t5-v1_1-xxl) and our trained MMDiT-like transformer model.
13
+
14
+ <img src="https://github.com/ai-forever/Kandinsky-4/assets/pipeline.png">
15
+
16
+ A serious problem for all diffusion models, and especially video generation models, is the generation speed. To solve this problem, we used the Latent Adversarial Diffusion Distillation (LADD) approach, proposed for distilling image generation models and first described in the [article](https://arxiv.org/pdf/2403.12015) from Stability AI and tested by us when training the [Kandinsky 3.1](https://github.com/ai-forever/Kandinsky-3) image generation model. The distillation pipeline itself involves additional training of the diffusion model in the GAN pipeline, i.e. joint training of the diffusion generator with the discriminator.
17
+
18
+ <img src="https://github.com/ai-forever/Kandinsky-4/assets/LADD.png">
19
+
20
+
21
+ ## Architecture
22
+
23
+ For training Kandinsky 4 Flash we used the following architecture of diffusion transformer, based on MMDiT proposed in [Stable Diffusion 3](https://arxiv.org/pdf/2403.03206).
24
+
25
+ <img src="https://github.com/ai-forever/Kandinsky-4/assets/MMDiT1.png"> <img src="https://github.com/ai-forever/Kandinsky-4/assets/MMDiT_block1.png">
26
+
27
+ For training flash version we used the following architecture of discriminator. Discriminator head structure resembles half of an MMDiT block.
28
+
29
+ <img src="https://github.com/ai-forever/Kandinsky-4/assets/discriminator.png"> <img src="https://github.com/ai-forever/Kandinsky-4/assets/discriminator_head.png">
30
+
31
+
32
+ ## How to use:
33
+ ```python
34
+ import torch
35
+ from IPython.display import Video
36
+ from kandinsky import get_T2V_pipeline
37
+
38
+ device_map = {
39
+ "dit": torch.device('cuda:0'),
40
+ "vae": torch.device('cuda:0'),
41
+ "text_embedder": torch.device('cuda:0')
42
+ }
43
+
44
+ pipe = get_T2V_pipeline(device_map)
45
+
46
+ images = pipe(
47
+ seed=42,
48
+ time_length=12,
49
+ width = 672,
50
+ height = 384,
51
+ save_path="./test.mp4",
52
+ text="Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance",
53
+ )
54
+
55
+ Video("./test.mp4")
56
+ ```
57
+
58
+ Examples of usage and more detailed parameters description are in the [examples.ipynb](https://github.com/ai-forever/Kandinsky-4/examples.ipynb) notebook.
59
+
60
+ Make sure that you have weights folder with weights of all models.
61
+
62
+ We also add distributed inference opportunity: [run_inference_distil.py](https://github.com/ai-forever/Kandinsky-4/run_inference_distil.py)
63
+
64
+ To run this examples:
65
+ ```
66
+ python -m torch.distributed.launch --nnodes n --nproc-per-node m run_inference_distil.py
67
+ ```
68
+ where n is a number of nodes you have and m is a number of gpus on these nodes. The code was tested with n=1 and m=8, so this is preferable parameters.
69
+
70
+ In distributed setting the DiT models are parallelized using tensor parallel on all gpus, which enables a significant speedup.
71
+
72
+ To run this examples from terminal without tensor parallel:
73
+ ```
74
+ python run_inference_distil.py
75
+ ```
76
+
77
+ # Authors
78
+ + Lev Novitkiy: [GitHub](https://github.com/leffff), [Blog](https://t.me/mlball_days)
79
+ + Maria Kovaleva [GitHub](https://github.com/MarKovka20)
80
+ + Vladimir Arkhipkin: [GitHub](https://github.com/oriBetelgeuse)
81
+ + Denis Parkhomenko: [GitHub](https://github.com/nihao88)
82
+ + Andrei Shutkin: [GitHub](https://github.com/maleficxp)
83
+ + Ivan Kirillov: [GitHub](https://github.com/funnylittleman)
84
+ + Zein Shaheen: [GitHub](https://github.com/zeinsh)
85
+ + Viacheslav Vasilev: [GitHub](https://github.com/vivasilev)
86
+ + Andrei Filatov [GitHub](https://github.com/anvilarth)
87
+ + Julia Agafonova
88
+ + Nikolay Gerasimenko [GitHub](https://github.com/Nikolay-Gerasimenko)
89
+ + Andrey Kuznetsov: [GitHub](https://github.com/kuznetsoffandrey), [Blog](https://t.me/complete_ai)
90
+ + Denis Dimitrov: [GitHub](https://github.com/denndimitrov), [Blog](https://t.me/dendi_math_ai)