Diffusers documentation

Stable diffusion 2

Diffusers

You are viewing v0.18.2 version. A newer version v0.34.0 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Stable diffusion 2

Stable Diffusion 2 is a text-to-image latent diffusion model built upon the work of Stable Diffusion 1. The project to train Stable Diffusion 2 was led by Robin Rombach and Katherine Crowson from Stability AI and LAION.

The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels. These models are trained on an aesthetic subset of the LAION-5B dataset created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using LAION’s NSFW filter.

For more details about how Stable Diffusion 2 works and how it differs from Stable Diffusion 1, please refer to the official launch announcement post.

Tips

Available checkpoints:

Note that the architecture is more or less identical to Stable Diffusion 1 so please refer to this page for API documentation.

Text-to-Image (512x512 resolution): stabilityai/stable-diffusion-2-base with StableDiffusionPipeline
Text-to-Image (768x768 resolution): stabilityai/stable-diffusion-2 with StableDiffusionPipeline
Image Inpainting (512x512 resolution): stabilityai/stable-diffusion-2-inpainting with StableDiffusionInpaintPipeline
Super-Resolution (x4 resolution resolution): stable-diffusion-x4-upscaler StableDiffusionUpscalePipeline
Depth-to-Image (512x512 resolution): stabilityai/stable-diffusion-2-depth with StableDiffusionDepth2ImagePipeline

We recommend using the DPMSolverMultistepScheduler as it’s currently the fastest scheduler there is.

Text-to-Image

Text-to-Image (512x512 resolution): stabilityai/stable-diffusion-2-base with StableDiffusionPipeline

from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch

repo_id = "stabilityai/stable-diffusion-2-base"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("astronaut.png")

Text-to-Image (768x768 resolution): stabilityai/stable-diffusion-2 with StableDiffusionPipeline

from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch

repo_id = "stabilityai/stable-diffusion-2"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0]
image.save("astronaut.png")

Experimental: "Common Diffusion Noise Schedules and Sample Steps are Flawed":

The paper Common Diffusion Noise Schedules and Sample Steps are Flawed claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion.

The abstract reads as follows:

*We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR), and some implementations of diffusion samplers do not start from the last timestep. Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference. We show that the flawed design causes real problems in existing implementations. In Stable Diffusion, it severely limits the model to only generate images with medium brightness and prevents it from generating very bright and dark samples. We propose a few simple fixes:

(1) rescale the noise schedule to enforce zero terminal SNR;
(2) train the model with v prediction;
(3) change the sampler to always start from the last timestep;
(4) rescale classifier-free guidance to prevent over-exposure. These simple changes ensure the diffusion process is congruent between training and inference and allow the model to generate samples more faithful to the original data distribution.*

You can apply all of these changes in diffusers when using DDIMScheduler:

(1) rescale the noise schedule to enforce zero terminal SNR;

pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True)

(2) train the model with v prediction; Continue fine-tuning a checkpoint with train_text_to_image.py or train_text_to_image_lora.py and --prediction_type="v_prediction".
(3) change the sampler to always start from the last timestep;

pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

(4) rescale classifier-free guidance to prevent over-exposure.

pipe(..., guidance_rescale=0.7)

An example is to use this checkpoint which has been fine-tuned using the "v_prediction".

The checkpoint can then be run in inference as follows:

from diffusers import DiffusionPipeline, DDIMScheduler

pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16)
pipe.scheduler = DDIMScheduler.from_config(
    pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
)
pipe.to("cuda")

prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
image = pipeline(prompt, guidance_rescale=0.7).images[0]

DDIMScheduler

class diffusers.DDIMScheduler

< source >

( num_train_timesteps: int = 1000 beta_start: float = 0.0001 beta_end: float = 0.02 beta_schedule: str = 'linear' trained_betas: typing.Union[numpy.ndarray, typing.List[float], NoneType] = None clip_sample: bool = True set_alpha_to_one: bool = True steps_offset: int = 0 prediction_type: str = 'epsilon' thresholding: bool = False dynamic_thresholding_ratio: float = 0.995 clip_sample_range: float = 1.0 sample_max_value: float = 1.0 timestep_spacing: str = 'leading' rescale_betas_zero_snr: bool = False )

Parameters

num_train_timesteps (int) — number of diffusion steps used to train the model.
beta_start (float) — the starting beta value of inference.
beta_end (float) — the final beta value.
beta_schedule (str) — the beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from linear, scaled_linear, or squaredcos_cap_v2.
trained_betas (np.ndarray, optional) — option to pass an array of betas directly to the constructor to bypass beta_start, beta_end etc.
clip_sample (bool, default True) — option to clip predicted sample for numerical stability.
clip_sample_range (float, default 1.0) — the maximum magnitude for sample clipping. Valid only when clip_sample=True.
set_alpha_to_one (bool, default True) — each diffusion step uses the value of alphas product at that step and at the previous one. For the final step there is no previous alpha. When this option is True the previous alpha product is fixed to 1, otherwise it uses the value of alpha at step 0.
steps_offset (int, default 0) — an offset added to the inference steps. You can use a combination of offset=1 and set_alpha_to_one=False, to make the last step use step 0 for the previous alpha product, as done in stable diffusion.
prediction_type (str, default epsilon, optional) — prediction type of the scheduler function, one of epsilon (predicting the noise of the diffusion process), sample (directly predicting the noisy sample) or v_prediction` (see section 2.4 https://imagen.research.google/video/paper.pdf)
thresholding (bool, default False) — whether to use the “dynamic thresholding” method (introduced by Imagen, https://arxiv.org/abs/2205.11487). Note that the thresholding method is unsuitable for latent-space diffusion models (such as stable-diffusion).
dynamic_thresholding_ratio (float, default 0.995) — the ratio for the dynamic thresholding method. Default is 0.995, the same as Imagen (https://arxiv.org/abs/2205.11487). Valid only when thresholding=True.
sample_max_value (float, default 1.0) — the threshold value for dynamic thresholding. Valid only when thresholding=True.
timestep_spacing (str, default "leading") — The way the timesteps should be scaled. Refer to Table 2. of Common Diffusion Noise Schedules and Sample Steps are Flawed for more information.
rescale_betas_zero_snr (bool, default False) — whether to rescale the betas to have zero terminal SNR (proposed by https://arxiv.org/pdf/2305.08891.pdf). This can enable the model to generate very bright and dark samples instead of limiting it to samples with medium brightness. Loosely related to --offset_noise.

Denoising diffusion implicit models is a scheduler that extends the denoising procedure introduced in denoising diffusion probabilistic models (DDPMs) with non-Markovian guidance.

~ConfigMixin takes care of storing all config attributes that are passed in the scheduler’s __init__ function, such as num_train_timesteps. They can be accessed via scheduler.config.num_train_timesteps. SchedulerMixin provides general loading and saving functionality via the SchedulerMixin.save_pretrained() and from_pretrained() functions.

For more details, see the original paper: https://arxiv.org/abs/2010.02502

scale_model_input

< source >

( sample: FloatTensor timestep: typing.Optional[int] = None ) → torch.FloatTensor

Parameters

sample (torch.FloatTensor) — input sample
timestep (int, optional) — current timestep

Returns

torch.FloatTensor

scaled input sample

Ensures interchangeability with schedulers that need to scale the denoising model input depending on the current timestep.

set_timesteps

< source >

( num_inference_steps: int device: typing.Union[str, torch.device] = None )

Parameters

num_inference_steps (int) — the number of diffusion steps used when generating samples with a pre-trained model.

Sets the discrete timesteps used for the diffusion chain. Supporting function to be run before inference.

step

< source >

( model_output: FloatTensor timestep: int sample: FloatTensor eta: float = 0.0 use_clipped_model_output: bool = False generator = None variance_noise: typing.Optional[torch.FloatTensor] = None return_dict: bool = True ) → ~schedulers.scheduling_utils.DDIMSchedulerOutput or tuple

Parameters

model_output (torch.FloatTensor) — direct output from learned diffusion model.
timestep (int) — current discrete timestep in the diffusion chain.
sample (torch.FloatTensor) — current instance of sample being created by diffusion process.
eta (float) — weight of noise for added noise in diffusion step.
use_clipped_model_output (bool) — if True, compute “corrected” model_output from the clipped predicted original sample. Necessary because predicted original sample is clipped to [-1, 1] when self.config.clip_sample is True. If no clipping has happened, “corrected” model_output would coincide with the one provided as input and use_clipped_model_output will have not effect. generator — random number generator.
variance_noise (torch.FloatTensor) — instead of generating noise for the variance using generator, we can directly provide the noise for the variance itself. This is useful for methods such as CycleDiffusion. (https://arxiv.org/abs/2210.05559)
return_dict (bool) — option for returning tuple rather than DDIMSchedulerOutput class

Returns

~schedulers.scheduling_utils.DDIMSchedulerOutput or tuple

~schedulers.scheduling_utils.DDIMSchedulerOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

Predict the sample at the previous timestep by reversing the SDE. Core function to propagate the diffusion process from the learned model outputs (most often the predicted noise).

Image Inpainting

Image Inpainting (512x512 resolution): stabilityai/stable-diffusion-2-inpainting with StableDiffusionInpaintPipeline

import PIL
import requests
import torch
from io import BytesIO

from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler


def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")


img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

repo_id = "stabilityai/stable-diffusion-2-inpainting"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0]

image.save("yellow_cat.png")

Super-Resolution

Image Upscaling (x4 resolution resolution): stable-diffusion-x4-upscaler with StableDiffusionUpscalePipeline

import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionUpscalePipeline
import torch

# load model and scheduler
model_id = "stabilityai/stable-diffusion-x4-upscaler"
pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")

# let's download an  image
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
response = requests.get(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
low_res_img = low_res_img.resize((128, 128))
prompt = "a white cat"
upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
upscaled_image.save("upsampled_cat.png")

Depth-to-Image

Depth-Guided Text-to-Image: stabilityai/stable-diffusion-2-depth StableDiffusionDepth2ImagePipeline

import torch
import requests
from PIL import Image

from diffusers import StableDiffusionDepth2ImgPipeline

pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-depth",
    torch_dtype=torch.float16,
).to("cuda")


url = "http://images.cocodataset.org/val2017/000000039769.jpg"
init_image = Image.open(requests.get(url, stream=True).raw)
prompt = "two tigers"
n_propmt = "bad, deformed, ugly, bad anotomy"
image = pipe(prompt=prompt, image=init_image, negative_prompt=n_propmt, strength=0.7).images[0]

How to load and use different schedulers.

The stable diffusion pipeline uses DDIMScheduler scheduler by default. But diffusers provides many other schedulers that can be used with the stable diffusion pipeline such as PNDMScheduler, LMSDiscreteScheduler, EulerDiscreteScheduler, EulerAncestralDiscreteScheduler etc. To use a different scheduler, you can either change it via the ConfigMixin.from_config() method or pass the scheduler argument to the from_pretrained method of the pipeline. For example, to use the EulerDiscreteScheduler, you can do the following:

>>> from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler

>>> pipeline = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2")
>>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)

>>> # or
>>> euler_scheduler = EulerDiscreteScheduler.from_pretrained("stabilityai/stable-diffusion-2", subfolder="scheduler")
>>> pipeline = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2", scheduler=euler_scheduler)

←Safe Stable Diffusion Stable Diffusion XL→