Pipelines
Pipelines provide a simple way to run state-of-the-art diffusion models in inference. Most diffusion systems consist of multiple independently-trained models and highly adaptable scheduler components - all of which are needed to have a functioning end-to-end diffusion system.
As an example, Stable Diffusion has three independently trained models:
- Autoencoder
- Conditional Unet
- CLIP text encoder
- a scheduler component, scheduler,
- a CLIPImageProcessor,
- as well as a safety checker. All of these components are necessary to run stable diffusion in inference even though they were trained or created independently from each other.
To that end, we strive to offer all open-sourced, state-of-the-art diffusion system under a unified API. More specifically, we strive to provide pipelines that
- can load the officially published weights and yield 1-to-1 the same outputs as the original implementation according to the corresponding paper (e.g. LDMTextToImagePipeline, uses the officially released weights of High-Resolution Image Synthesis with Latent Diffusion Models),
- have a simple user interface to run the model in inference (see the Pipelines API section),
- are easy to understand with code that is self-explanatory and can be read along-side the official paper (see Pipelines summary),
- can easily be contributed by the community (see the Contribution section).
Note that pipelines do not (and should not) offer any training functionality. If you are looking for official training examples, please have a look at examples.
🧨 Diffusers Summary
The following table summarizes all officially supported pipelines, their corresponding paper, and if available a colab notebook to directly try them out.
Note: Pipelines are simple examples of how to play around with the diffusion systems as described in the corresponding papers.
However, most of them can be adapted to use different scheduler components or even different model components. Some pipeline examples are shown in the Examples below.
Pipelines API
Diffusion models often consist of multiple independently-trained models or other previously existing components.
Each model has been trained independently on a different task and the scheduler can easily be swapped out and replaced with a different one. During inference, we however want to be able to easily load all components and use them in inference - even if one component, e.g. CLIP’s text encoder, originates from a different library, such as Transformers. To that end, all pipelines provide the following functionality:
from_pretrained
method that accepts a Hugging Face Hub repository id, e.g. runwayml/stable-diffusion-v1-5 or a path to a local directory, e.g. ”./stable-diffusion”. To correctly retrieve which models and components should be loaded, one has to provide amodel_index.json
file, e.g. runwayml/stable-diffusion-v1-5/model_index.json, which defines all components that should be loaded into the pipelines. More specifically, for each model/component one needs to define the format<name>: ["<library>", "<class name>"]
.<name>
is the attribute name given to the loaded instance of<class name>
which can be found in the library or pipeline folder called"<library>"
.save_pretrained
that accepts a local path, e.g../stable-diffusion
under which all models/components of the pipeline will be saved. For each component/model a folder is created inside the local path that is named after the given attribute name, e.g../stable_diffusion/unet
. In addition, amodel_index.json
file is created at the root of the local path, e.g../stable_diffusion/model_index.json
so that the complete pipeline can again be instantiated from the local path.to
which accepts astring
ortorch.device
to move all models that are of typetorch.nn.Module
to the passed device. The behavior is fully analogous to PyTorch’sto
method.__call__
method to use the pipeline in inference.__call__
defines inference logic of the pipeline and should ideally encompass all aspects of it, from pre-processing to forwarding tensors to the different models and schedulers, as well as post-processing. The API of the__call__
method can strongly vary from pipeline to pipeline. E.g. a text-to-image pipeline, such asStableDiffusionPipeline
should accept among other things the text prompt to generate the image. A pure image generation pipeline, such as DDPMPipeline on the other hand can be run without providing any inputs. To better understand what inputs can be adapted for each pipeline, one should look directly into the respective pipeline.
Note: All pipelines have PyTorch’s autograd disabled by decorating the __call__
method with a torch.no_grad
decorator because pipelines should
not be used for training. If you want to store the gradients during the forward pass, we recommend writing your own pipeline, see also our community-examples.