File size: 6,375 Bytes
4ad85e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: apache-2.0
---
# Kandinsky Video — a new text-to-video generation model 
## SoTA quality among open-source solutions

This repository is the official implementation of Kandinsky Video model


Paper | [Project](https://ai-forever.github.io/kandinsky-video/) | ![Hugging Face Spaces](https://img.shields.io/badge/🤗-Huggingface-yello.svg) | [Telegram-bot](https://t.me/video_kandinsky_bot) | Habr post


<p align="center">
<img src="__assets__/title.JPG" width="800px"/>
<br>
<em>Kandinsky Video is a text-to-video generation model, which is based on the FusionFrames architecture, consisting of two main stages: keyframe generation and interpolation. Our approach for temporal conditioning allows us to generate videos with high-quality appearance, smoothness and dynamics.</em>
</p>



## Pipeline

<p align="center">
<img src="__assets__/pipeline.jpg" width="800px"/>
<br>
<em>The encoded text prompt enters the U-Net keyframe generation model with temporal layers or blocks, and then the sampled latent keyframes are  sent to the latent interpolation model in such a way as to predict three interpolation frames between two keyframes. A temporal MoVQ-GAN decoder is used to get the final video result.</em>
</p>


**Architecture details**

+ Text encoder (Flan-UL2) - 8.6B
+ Latent Diffusion U-Net3D - 4.0B
+ MoVQ encoder/decoder - 256M


## How to use

Check our jupyter notebooks with examples in `./examples` folder
### 1. text2video

```python
from video_kandinsky3 import get_T2V_pipeline

t2v_pipe = get_T2V_pipeline('cuda', fp16=True)

pfps = 'medium' # ['low', 'medium', 'high']
video = t2v_pipe(
    'a red car is drifting on the mountain road, close view, fast movement',
    width=640, height=384, fps=fps
)
```


## Results


<table class="center">
<tr>
  <td><img src="__assets__/results/A car moving on the road from the sea to the mountains.gif" raw=true></td>
  <td><img src="__assets__/results/A red car drifting, 4k video.gif"></td>
  <td><img src="__assets__/results/chemistry laboratory, chemical explosion, 4k.gif"></td>
  <td><img src="__assets__/results/Erupting volcano_ raw power, molten lava, and the forces of the Earth.gif"></td>
</tr>
<tr>
  <td width=25% align="center">"A car moving on the road from the sea to the mountains"</td>
  <td width=25% align="center">"A red car drifting, 4k video"</td>
  <td width=25% align="center">"Chemistry laboratory, chemical explosion, 4k"</td>
  <td width=25% align="center">"Erupting volcano raw power, molten lava, and the forces of the Earth"</td>
</tr>

<tr>
  <td><img src="__assets__/results/luminescent jellyfish swims underwater, neon, 4k.gif" raw=true></td>
  <td><img src="__assets__/results/Majestic waterfalls in a lush rainforest_ power, mist, and biodiversity.gif"></td>
  <td><img src="__assets__/results/white ghost flies through a night clearing, 4k.gif"></td>
  <td><img src="__assets__/results/Wildlife migration_ herds on the move, crossing landscapes in harmony.gif"></td>
</tr>
<tr>
  <td width=25% align="center">"Luminescent jellyfish swims underwater, neon, 4k"</td>
  <td width=25% align="center">"Majestic waterfalls in a lush rainforest power, mist, and biodiversity"</td>
  <td width=25% align="center">"White ghost flies through a night clearing, 4k"</td>
  <td width=25% align="center">"Wildlife migration herds on the move, crossing landscapes in harmony"</td>
</tr>

<tr>
  <td><img src="__assets__/results/Majestic humpback whale breaching_ power, grace, and ocean spectacle.gif" raw=true></td>
  <td><img src="__assets__/results/Evoke the sense of wonder in a time-lapse journey through changing seasons..gif"></td>
  <td><img src="__assets__/results/Explore the fascinating world of underwater creatures in a visually stunning sequence.gif"></td>
  <td><img src="__assets__/results/Polar ice caps_ the pristine wilderness of the Arctic and Antarctic.gif"></td>
</tr>
<tr>
  <td width=25% align="center">"Majestic humpback whale breaching power, grace, and ocean spectacle"</td>
  <td width=25% align="center">"Evoke the sense of wonder in a time-lapse journey through changing seasons"</td>
  <td width=25% align="center">"Explore the fascinating world of underwater creatures in a visually stunning sequence"</td>
  <td width=25% align="center">"Polar ice caps the pristine wilderness of the Arctic and Antarctic"</td>
</tr>


<tr>
  <td><img src="__assets__/results/Rolling waves on a sandy beach_ relaxation, rhythm, and coastal beauty.gif" raw=true></td>
  <td><img src="__assets__/results/Sloth in slow motion_ deliberate movements, relaxation, and arboreal life.gif"></td>
  <td><img src="__assets__/results/Time-lapse of a flower blooming_ growth, beauty, and the passage of time..gif"></td>
  <td><img src="__assets__/results/Craft a heartwarming narrative showcasing the bond between a human and their loyal pet companion..gif"></td>
</tr>
<tr>
  <td width=25% align="center">"Rolling waves on a sandy beach relaxation, rhythm, and coastal beauty"</td>
  <td width=25% align="center">"Sloth in slow motion deliberate movements, relaxation, and arboreal life"</td>
  <td width=25% align="center">"Time-lapse of a flower blooming growth, beauty, and the passage of time"</td>
  <td width=25% align="center">"Craft a heartwarming narrative showcasing the bond between a human and their loyal pet companion"</td>
</tr>


</table>


# Authors

+ Vladimir Arkhipkin: [Github](https://github.com/oriBetelgeuse), [Google Scholar](https://scholar.google.com/citations?user=D-Ko0oAAAAAJ&hl=ru)
+ Zein Shaheen: [Github](https://github.com/zeinsh), [Google Scholar](https://scholar.google.ru/citations?user=bxlgMxMAAAAJ&hl=en)
+ Viacheslav Vasilev: [Github](https://github.com/vivasilev), [Google Scholar](https://scholar.google.com/citations?user=redAz-kAAAAJ&hl=ru&oi=sra)
+ Igor Pavlov: [Github](https://github.com/boomb0om)
+ Elizaveta Dakhova: [Github](https://github.com/LizaDakhova)
+ Anastasia Lysenko: [Github](https://github.com/LysenkoAnastasia)
+ Sergey Markov
+ Denis Dimitrov: [Github](https://github.com/denndimitrov), [Google Scholar](https://scholar.google.com/citations?user=3JSIJpYAAAAJ&hl=ru&oi=ao)
+ Andrey Kuznetsov: [Github](https://github.com/kuznetsoffandrey), [Google Scholar](https://scholar.google.com/citations?user=q0lIfCEAAAAJ&hl=ru)


## BibTeX
If you use our work in your research, please cite our publication:
```
TBD
```