Papers
arxiv:2311.10709

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Published on Nov 17, 2023
· Featured in Daily Papers on Nov 20, 2023
Authors:
,
,
,
,
,

Abstract

We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training--that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work--81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.

Community

Aannaque_Stunning_disney_pixar_cartoon_underwater_stunning_free_91764254-2e44-469b-82fd-e48769a40962.png

smile
Jair_Algemados-removebg-preview.png

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2311.10709 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2311.10709 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2311.10709 in a Space README.md to link it from this page.

Collections including this paper 11