LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

Published on Sep 26, 2023
Β· Featured in Daily Papers on Sep 27, 2023
Xin Ma ,
Bo Dai ,


This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.


Here is an ML-generated summary

The paper proposes LaVie, a cascaded latent diffusion model framework for high-quality text-to-video generation, by leveraging a pre-trained text-to-image model as initialization.

The key contributions are: 1) An efficient temporal module design using temporal self-attention and rotary positional encoding. 2) A joint image-video fine-tuning strategy to mitigate catastrophic forgetting. 3) A new text-video dataset Vimeo25M with 25M high-resolution videos.


  • Simple temporal self-attention coupled with rotary positional encoding effectively captures temporal correlations. More complex architectures provide marginal gains.
  • Joint image-video fine-tuning plays a pivotal role in producing high-quality and creative results. Direct video-only fine-tuning leads to catastrophic forgetting.
  • Joint fine-tuning enables large-scale knowledge transfer from images to videos, including styles, scenes, and characters.
  • High-quality dataset like Vimeo25M is critical for training high-fidelity T2V models.


  • Base T2V model initialized from pre-trained Stable Diffusion and adapted via pseudo-3D convolutions and spatio-temporal transformers.
  • Temporal interpolation model trained to increase frame rate 4x. Takes base video as input and outputs interpolated 61 frames.
  • Video super-resolution model fine-tuned to increase spatial resolution to 1280x2048. Leverages image super-resolution model as initialization.
  • Joint image-video fine-tuning utilized during training to enable knowledge transfer.

Both quantitative and qualitative evaluations demonstrate LaVie achieves state-of-the-art performance in zero-shot text-to-video generation.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 4

Collections including this paper 7