LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.
Here is an ML-generated summary
The paper proposes LaVie, a cascaded latent diffusion model framework for high-quality text-to-video generation, by leveraging a pre-trained text-to-image model as initialization.
The key contributions are: 1) An efficient temporal module design using temporal self-attention and rotary positional encoding. 2) A joint image-video fine-tuning strategy to mitigate catastrophic forgetting. 3) A new text-video dataset Vimeo25M with 25M high-resolution videos.
- Simple temporal self-attention coupled with rotary positional encoding effectively captures temporal correlations. More complex architectures provide marginal gains.
- Joint image-video fine-tuning plays a pivotal role in producing high-quality and creative results. Direct video-only fine-tuning leads to catastrophic forgetting.
- Joint fine-tuning enables large-scale knowledge transfer from images to videos, including styles, scenes, and characters.
- High-quality dataset like Vimeo25M is critical for training high-fidelity T2V models.
- Base T2V model initialized from pre-trained Stable Diffusion and adapted via pseudo-3D convolutions and spatio-temporal transformers.
- Temporal interpolation model trained to increase frame rate 4x. Takes base video as input and outputs interpolated 61 frames.
- Video super-resolution model fine-tuned to increase spatial resolution to 1280x2048. Leverages image super-resolution model as initialization.
- Joint image-video fine-tuning utilized during training to enable knowledge transfer.
Both quantitative and qualitative evaluations demonstrate LaVie achieves state-of-the-art performance in zero-shot text-to-video generation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation (2023)
- SimDA: Simple Diffusion Adapter for Efficient Video Generation (2023)
- Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation (2023)
- VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation (2023)
- Dual-Stream Diffusion Net for Text-to-Video Generation (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper