Kandinsky 4.0 T2V Flash: Text-to-Video diffusion model

Description:

Kandinsky 4.0 T2V Flash is a text-to-video generation model based on latent diffusion for 480p resolution, that can generate 12 second videos in 480p resolution in 11 seconds on a single NVIDIA H100 gpu. The pipeline consist of 3D causal CogVideoX VAE, text embedder T5-V1.1-XXL and our trained MMDiT-like transformer model.

A serious problem for all diffusion models, and especially video generation models, is the generation speed. To solve this problem, we used the Latent Adversarial Diffusion Distillation (LADD) approach, proposed for distilling image generation models and first described in the article from Stability AI and tested by us when training the Kandinsky 3.1 image generation model. The distillation pipeline itself involves additional training of the diffusion model in the GAN pipeline, i.e. joint training of the diffusion generator with the discriminator.

Architecture

For training Kandinsky 4.0 T2V Flash we used the following architecture of diffusion transformer, based on MMDiT proposed in Stable Diffusion 3.

For training flash version we used the following architecture of discriminator. Discriminator head structure resembles half of an MMDiT block.

Installation

git clone https://github.com/ai-forever/Kandinsky-4.git
cd Kandinsky-4
pip install -r kandinsky4_video2audio/requirements.txt

Inference:

import torch
from IPython.display import Video
from kandinsky import get_T2V_pipeline

device_map = {
    "dit": torch.device('cuda:0'), 
    "vae": torch.device('cuda:0'), 
    "text_embedder": torch.device('cuda:0')
}

pipe = get_T2V_pipeline(device_map)

images = pipe(
    seed=42,
    time_length=12,
    width = 672,
    height = 384,
    save_path="./test.mp4",
    text="Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance",
)

Video("./test.mp4")

Examples of usage and more detailed parameters description are in the examples.ipynb notebook.

Make sure that you have weights folder with weights of all models.

We also add distributed inference opportunity: run_inference_distil.py

To run this examples:

python -m torch.distributed.launch --nnodes n --nproc-per-node m run_inference_distil.py

where n is a number of nodes you have and m is a number of gpus on these nodes. The code was tested with n=1 and m=8, so this is preferable parameters.

In distributed setting the DiT models are parallelized using tensor parallel on all gpus, which enables a significant speedup.

To run this examples from terminal without tensor parallel:

python run_inference_distil.py

Authors

Project Leader

Denis Dimitrov

Scientific Consultants

Andrey Kuznetsov, Sergey Markov

Training Pipeline & Model Pretrain & Model Distillation

Vladimir Arkhipkin, Novitskiy Lev, Maria Kovaleva

Model Architecture

Vladimir Arkhipkin, Maria Kovaleva, Zein Shaheen, Arsen Kuzhamuratov, Nikolay Gerasimenko, Mikhail Zhirnov, Alexandr Gambashidze, Konstantin Sobolev

Data Pipeline

Ivan Kirillov, Andrei Shutkin, Kirill Chernishev, Julia Agafonova, Denis Parkhomenko

Video-to-audio model

Zein Shaheen, Arseniy Shakhmatov, Denis Parkhomenko

Quality Assessment

Nikolay Gerasimenko, Anna Averchenkova, Victor Panshin, Vladislav Veselov, Pavel Perminov, Vladislav Rodionov, Sergey Skachkov, Stepan Ponomarev

Other Contributors

Viacheslav Vasilev, Andrei Filatov, Gregory Leleytner

ai-forever
/

kandinsky-4-t2v-flash