arxiv:2309.15103

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

Published on Sep 26, 2023

· Submitted by

akhaliq on Sep 27, 2023

#2 Paper of the day

Upvote

Authors:

Yaohui Wang ,

Xinyuan Chen ,

Xin Ma ,

Shangchen Zhou ,

Ziqi Huang ,

Ceyuan Yang ,

Yinan He ,

Jiashuo Yu ,

Peiqing Yang ,

Yuwei Guo ,

Tianxing Wu ,

Chenyang Si ,

Yuming Jiang ,

Bo Dai ,

Dahua Lin ,

Ziwei Liu

Abstract

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.

View arXiv page View PDF Add to collection

Community

osanseviero

Sep 27, 2023

Here is an ML-generated summary

Objective
The paper proposes LaVie, a cascaded latent diffusion model framework for high-quality text-to-video generation, by leveraging a pre-trained text-to-image model as initialization.

The key contributions are: 1) An efficient temporal module design using temporal self-attention and rotary positional encoding. 2) A joint image-video fine-tuning strategy to mitigate catastrophic forgetting. 3) A new text-video dataset Vimeo25M with 25M high-resolution videos.

Insights

Simple temporal self-attention coupled with rotary positional encoding effectively captures temporal correlations. More complex architectures provide marginal gains.
Joint image-video fine-tuning plays a pivotal role in producing high-quality and creative results. Direct video-only fine-tuning leads to catastrophic forgetting.
Joint fine-tuning enables large-scale knowledge transfer from images to videos, including styles, scenes, and characters.
High-quality dataset like Vimeo25M is critical for training high-fidelity T2V models.

Implementation

Base T2V model initialized from pre-trained Stable Diffusion and adapted via pseudo-3D convolutions and spatio-temporal transformers.
Temporal interpolation model trained to increase frame rate 4x. Takes base video as input and outputs interpolated 61 frames.
Video super-resolution model fine-tuned to increase spatial resolution to 1280x2048. Leverages image super-resolution model as initialization.
Joint image-video fine-tuning utilized during training to enable knowledge transfer.

Results
Both quantitative and qualitative evaluations demonstrate LaVie achieves state-of-the-art performance in zero-shot text-to-video generation.