THUDM/CogVideoX-5b · CogVideoX1.5-5B-I2V Windows and Cloud Tutorial - 1-Click To Install With all Optimizations on Windows Python 3.11 VENV

Tutorial - Guide

This is not an issue thread. Thank you so much for this amazing model.

Full YouTube Tutorial: A step-by-step guide on using CogVideoX1.5–5B-I2V can be found here

https://youtu.be/5UCkMzP2VLE

Best Open Source Image to Video Generator CogVideoX1.5-5B-I2V Step by Step Windows & Cloud Tutorial

Video Tutorial and Installation Guides:

Full YouTube Tutorial: A step-by-step guide on using CogVideoX1.5–5B-I2V can be found here

https://youtu.be/5UCkMzP2VLE

1-Click Installers

For streamlined setup, I’ve created 1-Click installers for Windows, RunPod, and Massed Compute environments.

These are available at: https://www.patreon.com/posts/112848192

Note: These installers set up the model within a Python 3.11 virtual environment (VENV).

Model Repositories and Prompts:

Official Hugging Face Repo: The official Hugging Face repository for CogVideoX1.5–5B-I2V is located at: https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V

Official GitHub Repo: You can find the official GitHub repository here: https://github.com/THUDM/CogVideo

Prompts Used: I used the following prompts (in a text file) to generate the example videos:
https://gist.github.com/FurkanGozukara/471db7b987ab8d9877790358c126ac05

Configuration and Optimizations:

Video Settings: I generated videos using 1360x768px resolution images at 16 FPS for 81 frames (resulting in approximately 5-second videos, including the initial frame).

Enabled Optimizations: I utilized the following optimizations recommended on the Hugging Face page:

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

Quantization: I used int8_weight_only quantization. Note that TorchAO and DeepSpeed are required. My automatic installer installs both into a Python 3.11 VENV and works perfect on Windows.

Audio Generation:

MMAudio Model: For adding audio to the generated videos, I used the MMAudio model: https://github.com/hkchengrex/MMAudio

MMAudio Installers: 1-Click installers for MMAudio (Windows, RunPod, Massed Compute) are available at: https://www.patreon.com/posts/117990364

Note: These installers use a Python 3.10 VENV.

Prompting MMAudio: I used simple prompts for audio generation. Be aware that MMAudio may struggle when the input video contains human figures. In such cases, consider using text-to-audio alternatives.

VRAM Usage Observations:

I tested CogVideoX1.5–5B-I2V with various resolutions and frame counts to determine VRAM usage. Here are some of my findings (note that lower VRAM GPUs might still work, albeit slower):

512x288 (41 frames): ~7700 MB
576x320 (41 frames): ~7900 MB
576x320 (81 frames): ~8850 MB
704x384 (81 frames): ~8950 MB
768x432 (81 frames): ~10600 MB
896x496 (81 frames): ~12050 MB
960x528 (81 frames): ~12850 MB
1024x576 (81 frames): ~13900 MB
1280x720 (81 frames): ~17950 MB
1360x768 (81 frames): ~19000 MB
Gradio App:

Our Gradio application is highly advanced and functions flawlessly.

Demo video made with CogVideoX1.5-5B-I2V - watch sound on