LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Abstract
Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15times (11.5times) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.
Community
🎉Excited to introduce our latest work, LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity!
✨For the first time, we demonstrate high-resolution 68-second video generation at 16fps on a single GPU— without relying on autoregressive extensions, super-resolution, or frame interpolation.
🚀Our approach achieves linear computational complexity, offering up to 15x speed-up over the standard DiT architecture, while delivering improved video quality and better text alignment. We believe this linear complexity provides extraordinary scalability, paving the way to hour-length movie generation.
Paper: https://arxiv.org/pdf/2412.09856
Project website: https://lineargen.github.io
Will you opensource the code and model?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning (2024)
- Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis (2024)
- From Slow Bidirectional to Fast Causal Video Generators (2024)
- ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation (2024)
- HunyuanVideo: A Systematic Framework For Large Video Generative Models (2024)
- Multimodal Instruction Tuning with Hybrid State Space Models (2024)
- PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper