Add quantization examples using torchao and quanto

Hey, I'm Aryan from the Diffusers team 👋

Congratulations on the release of CogVideoX-5B!

It would be great to showcase some examples on how the quantized inference (`int8`, and other datatypes) can be run to lower memory requirements by using TorchAO and Quanto, especially since we mention it in the model card table. Feel free to modify the code/wording/URLs in whichever way you see best fit. Could we do it for the chinese README, CogVideoX-2B and CogVideo GitHub repo as well? Thanks!

Files changed (1) hide show

README.md +55 -0

README.md CHANGED Viewed

@@ -242,6 +242,61 @@ video = pipe(
 export_to_video(video, "output.mp4", fps=8)
 ```
 ## Explore the Model
 Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:

 export_to_video(video, "output.mp4", fps=8)
 ```
+## Quantized Inference
+[PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This makes it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that TorchAO quantization is fully compatible with `torch.compile`, which allows for much faster inference speed.
+```diff
+# To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
+# Source and nightly installation is only required until next release.
+import torch
+from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
+from diffusers.utils import export_to_video
++ from transformers import T5EncoderModel
++ from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight
++ quantization = int8_weight_only
++ text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
++ quantize_(text_encoder, quantization())
++ transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
++ quantize_(transformer, quantization())
++ vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-5b", subfolder="vae", torch_dtype=torch.bfloat16)
++ quantize_(vae, quantization())
+# Create pipeline and run inference
+pipe = CogVideoXPipeline.from_pretrained(
+    "THUDM/CogVideoX-5b",
++    text_encoder=text_encoder,
++    transformer=transformer,
++    vae=vae,
+    torch_dtype=torch.bfloat16,
+)
+pipe.enable_model_cpu_offload()
+pipe.vae.enable_tiling()
+prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
+video = pipe(
+    prompt=prompt,
+    num_videos_per_prompt=1,
+    num_inference_steps=50,
+    num_frames=49,
+    guidance_scale=6,
+    generator=torch.Generator(device="cuda").manual_seed(42),
+).frames[0]
+export_to_video(video, "output.mp4", fps=8)
+```
+Additionally, the models can be serialized and stored in a quantized datatype to save disk space when using PytorchAO. Find examples and benchmarks at these links:
+- [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
+- [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
 ## Explore the Model
 Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find: