VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
Although recent text-to-video (T2V) generation methods have seen significant advancements, most of these works focus on producing short video clips of a single event with a single background (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.
So generating coherent videos spanning multiple scenes from text descriptions is hard with AI right now. You can make short clips easily but, smoothly transitioning across diverse events and maintaining continuity is the hard part.
In this paper from UNC Chapel Hill, the authors propose VIDEODIRECTORGPT, a two-stage framework attempting to address multi-scene video generation:
Here are my highlights from the paper:
- Two-stage approach: language model generates detailed "video plan", then video generation module renders scenes based on plan
- Video plan contains multi-scene descriptions, entities/layouts, backgrounds, consistency groupings - guides downstream video generation
- Video generation module called Layout2Vid trained on images, adds spatial layout control and cross-scene consistency to existing text-to-video model
- Experiments show improved object layout/control in single scene videos vs baselines
- Multi-scene videos display higher object consistency across scenes compared to baselines
- Competitive open-domain video generation performance maintained
The key innovation seems to be using a large language model to plot detailed video plans to guide overall video generation. And the video generator Layout2Vid adds better spatial and temporal control through some clever tweaks.
You can read my full summary here.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation (2023)
- StoryBench: A Multifaceted Benchmark for Continuous Story Visualization (2023)
- LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models (2023)
- Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models (2023)
- VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper