Pusa VidGen
Code Repository | Model Hub | Training Toolkit | Dataset
Overview
Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This shift was first presented in our FVDM paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text/Image/Video-to-Video) while maintaining exceptional motion fidelity and prompt adherence with our refined base model adaptations. Pusa-V0.5 represents an early preview based on Mochi1-Preview. We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.
β¨ Key Features
Comprehensive Multi-task Support:
- Text-to-Video generation
- Image-to-Video transformation
- Frame interpolation
- Video transitions
- Seamless looping
- Extended video generation
- And more...
Unprecedented Efficiency:
- Trained with only 0.1k H800 GPU hours
- Total training cost: $0.1k
- Hardware: 16 H800 GPUs
- Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
- Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!
Complete Open-Source Release:
- Full codebase
- Detailed architecture specifications
- Comprehensive training methodology
π Unique Architecture
Novel Diffusion Paradigm: Implements frame-level noise control with vectorized timesteps, originally introduced in the FVDM paper, enabling unprecedented flexibility and scalability.
Non-destructive Modification: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
Universal Applicability: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. Collaborations enthusiastically welcomed!
Installation and Usage
Download Weights
Option 1: Use the Hugging Face CLI:
pip install huggingface_hub
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
Option 2: Download directly from Hugging Face to your local machine.
Limitations
Pusa currently has several known limitations:
- The base Mochi model generates videos at relatively low resolution (480p)
- We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
- We welcome community contributions to enhance model performance and extend its capabilities
Related Work
- FVDM: Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
- Mochi: Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
Citation
If you find our work useful in your research, please consider citing:
@misc{Liu2025pusa,
title={Pusa: Thousands Timesteps Video Diffusion Model},
author={Yaofang Liu and Rui Liu},
year={2025},
url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}
@article{liu2024redefining,
title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
journal={arXiv preprint arXiv:2410.03160},
year={2024}
}
- Downloads last month
- 85