Abstract
Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fr\'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.
Community
car
生成一只哈士奇在海边散步的视频
No matter how beautiful the scenery is, if you have passed by, you must leave.
How to evaluate this video?
你好
No matter how beautiful the scenery is, if you have passed by, you must leave.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance (2023)
- Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions (2024)
- VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model (2023)
- VideoPoet: A Large Language Model for Zero-Shot Video Generation (2023)
- Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Frame 7: Encounter with colleague Ricky
Description: The rescue boat arrives, and the protagonist and colleague Ricky hug and exchange greetings, feeling overwhelmed.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper