Diffusers documentation

Load community pipelines and components

You are viewing v0.22.3 version. A newer version v0.27.2 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Load community pipelines and components

Community pipelines

Community pipelines are any DiffusionPipeline class that are different from the original implementation as specified in their paper (for example, the StableDiffusionControlNetPipeline corresponds to the Text-to-Image Generation with ControlNet Conditioning paper). They provide additional functionality or extend the original implementation of a pipeline.

There are many cool community pipelines like Speech to Image or Composable Stable Diffusion, and you can find all the official community pipelines here.

To load any community pipeline on the Hub, pass the repository id of the community pipeline to the custom_pipeline argument and the model repository where you’d like to load the pipeline weights and components from. For example, the example below loads a dummy pipeline from hf-internal-testing/diffusers-dummy-pipeline and the pipeline weights and components from google/ddpm-cifar10-32:

🔒 By loading a community pipeline from the Hugging Face Hub, you are trusting that the code you are loading is safe. Make sure to inspect the code online before loading and running it automatically!

from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained(
    "google/ddpm-cifar10-32", custom_pipeline="hf-internal-testing/diffusers-dummy-pipeline", use_safetensors=True
)

Loading an official community pipeline is similar, but you can mix loading weights from an official repository id and pass pipeline components directly. The example below loads the community CLIP Guided Stable Diffusion pipeline, and you can pass the CLIP model components directly to it:

from diffusers import DiffusionPipeline
from transformers import CLIPImageProcessor, CLIPModel

clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"

feature_extractor = CLIPImageProcessor.from_pretrained(clip_model_id)
clip_model = CLIPModel.from_pretrained(clip_model_id)

pipeline = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    custom_pipeline="clip_guided_stable_diffusion",
    clip_model=clip_model,
    feature_extractor=feature_extractor,
    use_safetensors=True,
)

For more information about community pipelines, take a look at the Community pipelines guide for how to use them and if you’re interested in adding a community pipeline check out the How to contribute a community pipeline guide!

Community components

If your pipeline has custom components that Diffusers doesn’t support already, you need to accompany the Python modules that implement them. These customized components could be VAE, UNet, scheduler, etc. For the text encoder, we rely on transformers anyway. So, that should be handled separately (more info here). The pipeline code itself can be customized as well.

Community components allow users to build pipelines that may have customized components that are not part of Diffusers. This section shows how users should use community components to build a community pipeline.

You’ll use the showlab/show-1-base pipeline checkpoint as an example here. Here, you have a custom UNet and a customized pipeline (TextToVideoIFPipeline). For convenience, let’s call the UNet ShowOneUNet3DConditionModel.

“showlab/show-1-base” already provides the checkpoints in the Diffusers format, which is a great starting point. So, let’s start loading up the components which are already well-supported:

  1. Text encoder
from transformers import T5Tokenizer, T5EncoderModel

pipe_id = "showlab/show-1-base"
tokenizer = T5Tokenizer.from_pretrained(pipe_id, subfolder="tokenizer")
text_encoder = T5EncoderModel.from_pretrained(pipe_id, subfolder="text_encoder")
  1. Scheduler
from diffusers import DPMSolverMultistepScheduler

scheduler = DPMSolverMultistepScheduler.from_pretrained(pipe_id, subfolder="scheduler")
  1. Image processor
from transformers import CLIPFeatureExtractor

feature_extractor = CLIPFeatureExtractor.from_pretrained(pipe_id, subfolder="feature_extractor")

Now, you need to implement the custom UNet. The implementation is available here. So, let’s create a Python script called showone_unet_3d_condition.py and copy over the implementation, changing the UNet3DConditionModel classname to ShowOneUNet3DConditionModel to avoid any conflicts with Diffusers. This is because Diffusers already has one UNet3DConditionModel. We put all the components needed to implement the class in showone_unet_3d_condition.py only. You can find the entire file here.

Once this is done, we can initialize the UNet:

from showone_unet_3d_condition import ShowOneUNet3DConditionModel

unet = ShowOneUNet3DConditionModel.from_pretrained(pipe_id, subfolder="unet")

Then implement the custom TextToVideoIFPipeline in another Python script: pipeline_t2v_base_pixel.py. This is already available here.

Now that you have all the components, initialize the TextToVideoIFPipeline:

from pipeline_t2v_base_pixel import TextToVideoIFPipeline
import torch

pipeline = TextToVideoIFPipeline(
    unet=unet, 
    text_encoder=text_encoder, 
    tokenizer=tokenizer, 
    scheduler=scheduler, 
    feature_extractor=feature_extractor
)
pipeline = pipeline.to(device="cuda")
pipeline.torch_dtype = torch.float16

Push to the pipeline to the Hub to share with the community:

pipeline.push_to_hub("custom-t2v-pipeline")

After the pipeline is successfully pushed, you need a couple of changes:

  1. In model_index.json file, change the _class_name attribute. It should be like so.
  2. Upload showone_unet_3d_condition.py to the unet directory (example).
  3. Upload pipeline_t2v_base_pixel.py to the pipeline base directory (example).

To run inference, just do:

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "<change-username>/<change-id>", trust_remote_code=True, torch_dtype=torch.float16
).to("cuda")

prompt = "hello"

# Text embeds
prompt_embeds, negative_embeds = pipeline.encode_prompt(prompt)

# Keyframes generation (8x64x40, 2fps)
video_frames = pipeline(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    num_frames=8,
    height=40,
    width=64,
    num_inference_steps=2,
    guidance_scale=9.0,
    output_type="pt"
).frames

Here, notice the use of the trust_remote_code argument while initializing the pipeline. It is responsible for handling all the “magic” behind the scenes.