Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
Abstract
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Trajectory Attention for Fine-grained Video Motion Control (2024)
- MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control (2024)
- Optical-Flow Guided Prompt Optimization for Coherent Video Generation (2024)
- OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation (2024)
- Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training (2024)
- VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation (2024)
- SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper