Diffusers

You are viewing v0.16.0 version. A newer version v0.35.1 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Pipelines

Pipelines provide a simple way to run state-of-the-art diffusion models in inference. Most diffusion systems consist of multiple independently-trained models and highly adaptable scheduler components - all of which are needed to have a functioning end-to-end diffusion system.

As an example, Stable Diffusion has three independently trained models:

Autoencoder
Conditional Unet
CLIP text encoder
a scheduler component, scheduler,
a CLIPImageProcessor,
as well as a safety checker. All of these components are necessary to run stable diffusion in inference even though they were trained or created independently from each other.

To that end, we strive to offer all open-sourced, state-of-the-art diffusion system under a unified API. More specifically, we strive to provide pipelines that

1. can load the officially published weights and yield 1-to-1 the same outputs as the original implementation according to the corresponding paper (e.g. LDMTextToImagePipeline, uses the officially released weights of High-Resolution Image Synthesis with Latent Diffusion Models),
1. have a simple user interface to run the model in inference (see the Pipelines API section),
1. are easy to understand with code that is self-explanatory and can be read along-side the official paper (see Pipelines summary),
1. can easily be contributed by the community (see the Contribution section).

Note that pipelines do not (and should not) offer any training functionality. If you are looking for official training examples, please have a look at examples.

🧨 Diffusers Summary

The following table summarizes all officially supported pipelines, their corresponding paper, and if available a colab notebook to directly try them out.

Pipeline	Paper	Tasks	Colab
alt_diffusion	AltDiffusion	Image-to-Image Text-Guided Generation	-
audio_diffusion	Audio Diffusion	Unconditional Audio Generation
controlnet	ControlNet with Stable Diffusion	Image-to-Image Text-Guided Generation
cycle_diffusion	Cycle Diffusion	Image-to-Image Text-Guided Generation
dance_diffusion	Dance Diffusion	Unconditional Audio Generation
ddpm	Denoising Diffusion Probabilistic Models	Unconditional Image Generation
ddim	Denoising Diffusion Implicit Models	Unconditional Image Generation
if	IF	Image Generation
if_img2img	IF	Image-to-Image Generation
if_inpainting	IF	Image-to-Image Generation
latent_diffusion	High-Resolution Image Synthesis with Latent Diffusion Models	Text-to-Image Generation
latent_diffusion	High-Resolution Image Synthesis with Latent Diffusion Models	Super Resolution Image-to-Image
latent_diffusion_uncond	High-Resolution Image Synthesis with Latent Diffusion Models	Unconditional Image Generation
paint_by_example	Paint by Example: Exemplar-based Image Editing with Diffusion Models	Image-Guided Image Inpainting
pndm	Pseudo Numerical Methods for Diffusion Models on Manifolds	Unconditional Image Generation
score_sde_ve	Score-Based Generative Modeling through Stochastic Differential Equations	Unconditional Image Generation
score_sde_vp	Score-Based Generative Modeling through Stochastic Differential Equations	Unconditional Image Generation
semantic_stable_diffusion	SEGA: Instructing Diffusion using Semantic Dimensions	Text-to-Image Generation
stable_diffusion_text2img	Stable Diffusion	Text-to-Image Generation
stable_diffusion_img2img	Stable Diffusion	Image-to-Image Text-Guided Generation
stable_diffusion_inpaint	Stable Diffusion	Text-Guided Image Inpainting
stable_diffusion_panorama	MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation	Text-Guided Panorama View Generation
stable_diffusion_pix2pix	InstructPix2Pix: Learning to Follow Image Editing Instructions	Text-Based Image Editing
stable_diffusion_pix2pix_zero	Zero-shot Image-to-Image Translation	Text-Based Image Editing
stable_diffusion_attend_and_excite	Attend and Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models	Text-to-Image Generation
stable_diffusion_self_attention_guidance	Self-Attention Guidance	Text-to-Image Generation
stable_diffusion_image_variation	Stable Diffusion Image Variations	Image-to-Image Generation
stable_diffusion_latent_upscale	Stable Diffusion Latent Upscaler	Text-Guided Super Resolution Image-to-Image
stable_diffusion_2	Stable Diffusion 2	Text-to-Image Generation
stable_diffusion_2	Stable Diffusion 2	Text-Guided Image Inpainting
stable_diffusion_2	Stable Diffusion 2	Depth-to-Image Text-Guided Generation
stable_diffusion_2	Stable Diffusion 2	Text-Guided Super Resolution Image-to-Image
stable_diffusion_safe	Safe Stable Diffusion	Text-Guided Generation
stable_unclip	Stable unCLIP	Text-to-Image Generation
stable_unclip	Stable unCLIP	Image-to-Image Text-Guided Generation
stochastic_karras_ve	Elucidating the Design Space of Diffusion-Based Generative Models	Unconditional Image Generation
text_to_video_sd	Modelscope’s Text-to-video-synthesis Model in Open Domain	Text-to-Video Generation
unclip	Hierarchical Text-Conditional Image Generation with CLIP Latents	Text-to-Image Generation
versatile_diffusion	Versatile Diffusion: Text, Images and Variations All in One Diffusion Model	Text-to-Image Generation
versatile_diffusion	Versatile Diffusion: Text, Images and Variations All in One Diffusion Model	Image Variations Generation
versatile_diffusion	Versatile Diffusion: Text, Images and Variations All in One Diffusion Model	Dual Image and Text Guided Generation
vq_diffusion	Vector Quantized Diffusion Model for Text-to-Image Synthesis	Text-to-Image Generation
text_to_video_zero	Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators	Text-to-Video Generation

Note: Pipelines are simple examples of how to play around with the diffusion systems as described in the corresponding papers.

However, most of them can be adapted to use different scheduler components or even different model components. Some pipeline examples are shown in the Examples below.

Pipelines API

Diffusion models often consist of multiple independently-trained models or other previously existing components.

Each model has been trained independently on a different task and the scheduler can easily be swapped out and replaced with a different one. During inference, we however want to be able to easily load all components and use them in inference - even if one component, e.g. CLIP’s text encoder, originates from a different library, such as Transformers. To that end, all pipelines provide the following functionality:

from_pretrained method that accepts a Hugging Face Hub repository id, e.g. runwayml/stable-diffusion-v1-5 or a path to a local directory, e.g. ”./stable-diffusion”. To correctly retrieve which models and components should be loaded, one has to provide a model_index.json file, e.g. runwayml/stable-diffusion-v1-5/model_index.json, which defines all components that should be loaded into the pipelines. More specifically, for each model/component one needs to define the format <name>: ["<library>", "<class name>"]. <name> is the attribute name given to the loaded instance of <class name> which can be found in the library or pipeline folder called "<library>".
save_pretrained that accepts a local path, e.g. ./stable-diffusion under which all models/components of the pipeline will be saved. For each component/model a folder is created inside the local path that is named after the given attribute name, e.g. ./stable_diffusion/unet. In addition, a model_index.json file is created at the root of the local path, e.g. ./stable_diffusion/model_index.json so that the complete pipeline can again be instantiated from the local path.
to which accepts a string or torch.device to move all models that are of type torch.nn.Module to the passed device. The behavior is fully analogous to PyTorch’s to method.
__call__ method to use the pipeline in inference. __call__ defines inference logic of the pipeline and should ideally encompass all aspects of it, from pre-processing to forwarding tensors to the different models and schedulers, as well as post-processing. The API of the __call__ method can strongly vary from pipeline to pipeline. E.g. a text-to-image pipeline, such as StableDiffusionPipeline should accept among other things the text prompt to generate the image. A pure image generation pipeline, such as DDPMPipeline on the other hand can be run without providing any inputs. To better understand what inputs can be adapted for each pipeline, one should look directly into the respective pipeline.

Note: All pipelines have PyTorch’s autograd disabled by decorating the __call__ method with a torch.no_grad decorator because pipelines should not be used for training. If you want to store the gradients during the forward pass, we recommend writing your own pipeline, see also our community-examples.

Contribution

We are more than happy about any contribution to the officially supported pipelines 🤗. We aspire all of our pipelines to be self-contained, easy-to-tweak, beginner-friendly and for one-purpose-only.

Self-contained: A pipeline shall be as self-contained as possible. More specifically, this means that all functionality should be either directly defined in the pipeline file itself, should be inherited from (and only from) the DiffusionPipeline class or be directly attached to the model and scheduler components of the pipeline.
Easy-to-use: Pipelines should be extremely easy to use - one should be able to load the pipeline and use it for its designated task, e.g. text-to-image generation, in just a couple of lines of code. Most logic including pre-processing, an unrolled diffusion loop, and post-processing should all happen inside the __call__ method.
Easy-to-tweak: Certain pipelines will not be able to handle all use cases and tasks that you might like them to. If you want to use a certain pipeline for a specific use case that is not yet supported, you might have to copy the pipeline file and tweak the code to your needs. We try to make the pipeline code as readable as possible so that each part –from pre-processing to diffusing to post-processing– can easily be adapted. If you would like the community to benefit from your customized pipeline, we would love to see a contribution to our community-examples. If you feel that an important pipeline should be part of the official pipelines but isn’t, a contribution to the official pipelines would be even better.
One-purpose-only: Pipelines should be used for one task and one task only. Even if two tasks are very similar from a modeling point of view, e.g. image2image translation and in-painting, pipelines shall be used for one task only to keep them easy-to-tweak and readable.

Examples

Text-to-Image generation with Stable Diffusion

# make sure you're logged in with `huggingface-cli login`
from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]

image.save("astronaut_rides_horse.png")

Image-to-Image text-guided generation with Stable Diffusion

The StableDiffusionImg2ImgPipeline lets you pass a text prompt and an initial image to condition the generation of new images.

import requests
from PIL import Image
from io import BytesIO

from diffusers import StableDiffusionImg2ImgPipeline

# load the pipeline
device = "cuda"
pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to(
    device
)

# let's download an initial image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))

prompt = "A fantasy landscape, trending on artstation"

images = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images

images[0].save("fantasy_landscape.png")

You can also run this example on colab

Tweak prompts reusing seeds and latents

You can generate your own latents to reproduce results, or tweak your prompt on a specific result you liked. This notebook shows how to do it step by step. You can also run it in Google Colab

In-painting using Stable Diffusion

The StableDiffusionInpaintPipeline lets you edit specific parts of an image by providing a mask and text prompt.

import PIL
import requests
import torch
from io import BytesIO

from diffusers import StableDiffusionInpaintPipeline


def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")


img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]

You can also run this example on colab

←Loaders AltDiffusion→