Diffusers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

CogVideoX

CogVideoX is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An “expert” transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos.

You can find all the original CogVideoX checkpoints under the CogVideoX collection.

Click on the CogVideoX models in the right sidebar for more examples of other video generation tasks.

The example below demonstrates how to generate a video optimized for memory or inference speed.

memory

inference speed

Notes

CogVideoX supports LoRAs with load_lora_weights().

Show example code

import torch
from diffusers import CogVideoXPipeline
from diffusers.hooks import apply_group_offloading
from diffusers.utils import export_to_video

pipeline = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)
pipeline.to("cuda")

# load LoRA weights
pipeline.load_lora_weights("finetrainers/CogVideoX-1.5-crush-smol-v0", adapter_name="crush-lora")
pipeline.set_adapters("crush-lora", 0.9)

# model-offloading
pipeline.enable_model_cpu_offload()

prompt = """
PIKA_CRUSH A large metal cylinder is seen pressing down on a pile of Oreo cookies, flattening them as if they were under a hydraulic press.
"""
negative_prompt = "inconsistent motion, blurry motion, worse quality, degenerate outputs, deformed outputs"

video = pipeline(
    prompt=prompt, 
    negative_prompt=negative_prompt, 
    num_frames=81, 
    height=480,
    width=768,
    num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=16)

The text-to-video (T2V) checkpoints work best with a resolution of 1360x768 because that was the resolution it was pretrained on.
The image-to-video (I2V) checkpoints work with multiple resolutions. The width can vary from 768 to 1360, but the height must be 758. Both height and width must be divisible by 16.
Both T2V and I2V checkpoints work best with 81 and 161 frames. It is recommended to export the generated video at 16fps.

Refer to the table below to view memory usage when various memory-saving techniques are enabled.

method	memory usage (enabled)	memory usage (disabled)
enable_model_cpu_offload	19GB	33GB
enable_sequential_cpu_offload	<4GB	~33GB (very slow inference speed)
enable_tiling	11GB (with enable_model_cpu_offload)	---

Diffusers

CogVideoX

Notes

CogVideoXPipeline

class diffusers.CogVideoXPipeline

__call__

encode_prompt

fuse_qkv_projections

unfuse_qkv_projections

CogVideoXImageToVideoPipeline

class diffusers.CogVideoXImageToVideoPipeline

__call__

encode_prompt

fuse_qkv_projections

unfuse_qkv_projections

CogVideoXVideoToVideoPipeline

class diffusers.CogVideoXVideoToVideoPipeline

__call__

encode_prompt

fuse_qkv_projections

unfuse_qkv_projections

CogVideoXFunControlPipeline

class diffusers.CogVideoXFunControlPipeline

__call__

encode_prompt

fuse_qkv_projections

unfuse_qkv_projections

CogVideoXPipelineOutput

class diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput

call

call

call

call