arxiv:2305.13077

ControlVideo: Training-free Controllable Text-to-Video Generation

Published on May 22, 2023

· Featured in Daily Papers on May 23, 2023

Upvote

Authors:

Yabo Zhang ,

Yuxiang Wei ,

Abstract

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a training-free framework called ControlVideo to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.

View arXiv page View PDF Add to collection

Community

Raffaele

May 23, 2023

This comment has been hidden

jirohn

Jun 1, 2023

The monkey in a spacesuit leaps off a towering cliff, gracefully diving into a pristine pool of crystal-clear water. The ripples and reflections on the water's surface come alive, vividly capturing the scene in flawless 4K resolution, high definition, ultra-realistic visuals.

elaheatri

Nov 16, 2023

medium full shot ,3D fantasy animated character of a young man , teenage clothes, hand in pocket, fantasy atmosphere, high detail, cinematic lighting, 8K, blurred background, bokeh lighting

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2305.13077 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2305.13077 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.