YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Pusa VidGen

Code Repository | Model Hub | Training Toolkit | Dataset

Overview

Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This shift was first presented in our FVDM paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text/Image/Video-to-Video) while maintaining exceptional motion fidelity and prompt adherence with our refined base model adaptations. Pusa-V0.5 represents an early preview based on Mochi1-Preview. We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.

✨ Key Features

  • Comprehensive Multi-task Support:

    • Text-to-Video generation
    • Image-to-Video transformation
    • Frame interpolation
    • Video transitions
    • Seamless looping
    • Extended video generation
    • And more...
  • Unprecedented Efficiency:

    • Trained with only 0.1k H800 GPU hours
    • Total training cost: $0.1k
    • Hardware: 16 H800 GPUs
    • Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
    • Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!
  • Complete Open-Source Release:

    • Full codebase
    • Detailed architecture specifications
    • Comprehensive training methodology

πŸ” Unique Architecture

  • Novel Diffusion Paradigm: Implements frame-level noise control with vectorized timesteps, originally introduced in the FVDM paper, enabling unprecedented flexibility and scalability.

  • Non-destructive Modification: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.

  • Universal Applicability: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. Collaborations enthusiastically welcomed!

Installation and Usage

Download Weights

Option 1: Use the Hugging Face CLI:

pip install huggingface_hub
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>

Option 2: Download directly from Hugging Face to your local machine.

Limitations

Pusa currently has several known limitations:

  • The base Mochi model generates videos at relatively low resolution (480p)
  • We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
  • We welcome community contributions to enhance model performance and extend its capabilities

Related Work

  • FVDM: Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
  • Mochi: Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.

Citation

If you find our work useful in your research, please consider citing:

@misc{Liu2025pusa,
  title={Pusa: Thousands Timesteps Video Diffusion Model},
  author={Yaofang Liu and Rui Liu},
  year={2025},
  url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}
@article{liu2024redefining,
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
  journal={arXiv preprint arXiv:2410.03160},
  year={2024}
}
Downloads last month
85
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support