Pusa VidGen

Code Repository | Model Hub | Training Toolkit | Dataset

Overview

Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This shift was first presented in our FVDM paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text/Image/Video-to-Video) while maintaining exceptional motion fidelity and prompt adherence with our refined base model adaptations. Pusa-V0.5 represents an early preview based on Mochi1-Preview. We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.

✨ Key Features

Comprehensive Multi-task Support:
- Text-to-Video generation
- Image-to-Video transformation
- Frame interpolation
- Video transitions
- Seamless looping
- Extended video generation
- And more...
Unprecedented Efficiency:
- Trained with only 0.1k H800 GPU hours
- Total training cost: $0.1k
- Hardware: 16 H800 GPUs
- Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
- Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!
Complete Open-Source Release:
- Full codebase
- Detailed architecture specifications
- Comprehensive training methodology

🔍 Unique Architecture

Novel Diffusion Paradigm: Implements frame-level noise control with vectorized timesteps, originally introduced in the FVDM paper, enabling unprecedented flexibility and scalability.
Non-destructive Modification: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
Universal Applicability: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. Collaborations enthusiastically welcomed!

Installation and Usage

Download Weights

Option 1: Use the Hugging Face CLI:

pip install huggingface_hub
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>

Option 2: Download directly from Hugging Face to your local machine.

Limitations

Pusa currently has several known limitations:

The base Mochi model generates videos at relatively low resolution (480p)
We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
We welcome community contributions to enhance model performance and extend its capabilities

Related Work

FVDM: Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
Mochi: Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.

Citation

If you find our work useful in your research, please consider citing:

@misc{Liu2025pusa,
  title={Pusa: Thousands Timesteps Video Diffusion Model},
  author={Yaofang Liu and Rui Liu},
  year={2025},
  url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}

@article{liu2024redefining,
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
  journal={arXiv preprint arXiv:2410.03160},
  year={2024}
}