GaudiStableDiffusionPipeline

The GaudiStableDiffusionPipeline class enables to perform text-to-image generation on HPUs. It inherits from the GaudiDiffusionPipeline class that is the parent to any kind of diffuser pipeline.

To get the most out of it, it should be associated with a scheduler that is optimized for HPUs like GaudiDDIMScheduler.

GaudiStableDiffusionPipeline

class optimum.habana.diffusers.GaudiStableDiffusionPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor image_encoder: CLIPVisionModelWithProjection = None requires_safety_checker: bool = True use_habana: bool = False use_hpu_graphs: bool = False gaudi_config: Union = None bf16_full_eval: bool = False )

Parameters

vae (AutoencoderKL) — Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder (CLIPTextModel) — Frozen text-encoder (clip-vit-large-patch14).
tokenizer (~transformers.CLIPTokenizer) — A CLIPTokenizer to tokenize text.
unet (UNet2DConditionModel) — A UNet2DConditionModel to denoise the encoded image latents.
scheduler (SchedulerMixin) — A scheduler to be used in combination with unet to denoise the encoded image latents. Can be one of DDIMScheduler, LMSDiscreteScheduler, or PNDMScheduler.
safety_checker (StableDiffusionSafetyChecker) — Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the model card for more details about a model’s potential harms.
feature_extractor (CLIPImageProcessor) — A CLIPImageProcessor to extract features from generated images; used as inputs to the safety_checker.
use_habana (bool, defaults to False) — Whether to use Gaudi (True) or CPU (False).
use_hpu_graphs (bool, defaults to False) — Whether to use HPU graphs or not.
gaudi_config (Union[str, GaudiConfig], defaults to None) — Gaudi configuration to use. Can be a string to download it from the Hub. Or a previously initialized config can be passed.
bf16_full_eval (bool, defaults to False) — Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory compared to fp32/mixed precision but can harm generated images.

Adapted from: https://github.com/huggingface/diffusers/blob/v0.23.1/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L73

Generation is performed by batches
Two mark_step() were added to add support for lazy mode
Added support for HPU graphs

call

< source >

( prompt: Union = None height: Optional = None width: Optional = None num_inference_steps: int = 50 timesteps: List = None guidance_scale: float = 7.5 negative_prompt: Union = None num_images_per_prompt: Optional = 1 batch_size: int = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None ip_adapter_image: Union = None output_type: Optional = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None guidance_rescale: float = 0.0 clip_skip: Optional = None callback_on_step_end: Optional = None callback_on_step_end_tensor_inputs: List = ['latents'] **kwargs ) → GaudiStableDiffusionPipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide image generation. If not defined, you need to pass prompt_embeds.
height (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated images.
width (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated images.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
timesteps (List[int], optional) — Custom timesteps to use for the denoising process with schedulers which support a timesteps argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used. Must be in descending order.
guidance_scale (float, optional, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.
negative_prompt (str or List[str], optional) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to pass negative_prompt_embeds instead. Ignored when not using guidance (guidance_scale < 1).
num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
batch_size (int, optional, defaults to 1) — The number of images in a batch.
eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the ~schedulers.DDIMScheduler, and is ignored in other schedulers.
generator (torch.Generator or List[torch.Generator], optional) — A torch.Generator to make generation deterministic.
latents (torch.FloatTensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt input argument.
negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, negative_prompt_embeds are generated from the negative_prompt input argument. ip_adapter_image — (PipelineImageInput, optional): Optional image input to work with IP Adapters.
output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a GaudiStableDiffusionPipelineOutput instead of a plain tuple.
cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined in self.processor.
guidance_rescale (float, optional, defaults to 0.0) — Guidance rescale factor from Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR.
clip_skip (int, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.
callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

Returns

GaudiStableDiffusionPipelineOutput or tuple

If return_dict is True, ~diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

GaudiDiffusionPipeline

class optimum.habana.diffusers.GaudiDiffusionPipeline

< source >

( use_habana: bool = False use_hpu_graphs: bool = False gaudi_config: Union = None bf16_full_eval: bool = False )

Parameters

use_habana (bool, defaults to False) — Whether to use Gaudi (True) or CPU (False).
use_hpu_graphs (bool, defaults to False) — Whether to use HPU graphs or not.
gaudi_config (Union[str, GaudiConfig], defaults to None) — Gaudi configuration to use. Can be a string to download it from the Hub. Or a previously initialized config can be passed.
bf16_full_eval (bool, defaults to False) — Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory compared to fp32/mixed precision but can harm generated images.

Extends the DiffusionPipeline class:

The pipeline is initialized on Gaudi if use_habana=True.
The pipeline’s Gaudi configuration is saved and pushed to the hub.

from_pretrained

< source >

( pretrained_model_name_or_path: Union **kwargs )

More information here.

save_pretrained

< source >

( save_directory: Union safe_serialization: bool = True variant: Optional = None push_to_hub: bool = False **kwargs )

Parameters

save_directory (str or os.PathLike) — Directory to which to save. Will be created if it doesn’t exist.
safe_serialization (bool, optional, defaults to True) — Whether to save the model using safetensors or the traditional PyTorch way (that uses pickle).
variant (str, optional) — If specified, weights are saved in the format pytorch_model..bin.
push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace).
kwargs (Dict[str, Any], optional) — Additional keyword arguments passed along to the ~utils.PushToHubMixin.push_to_hub method.

Save the pipeline and Gaudi configurations. More information here.

GaudiDDIMScheduler

class optimum.habana.diffusers.GaudiDDIMScheduler

< source >

( num_train_timesteps: int = 1000 beta_start: float = 0.0001 beta_end: float = 0.02 beta_schedule: str = 'linear' trained_betas: Union = None clip_sample: bool = True set_alpha_to_one: bool = True steps_offset: int = 0 prediction_type: str = 'epsilon' thresholding: bool = False dynamic_thresholding_ratio: float = 0.995 clip_sample_range: float = 1.0 sample_max_value: float = 1.0 timestep_spacing: str = 'leading' rescale_betas_zero_snr: bool = False )

Parameters

num_train_timesteps (int, defaults to 1000) — The number of diffusion steps to train the model.
beta_start (float, defaults to 0.0001) — The starting beta value of inference.
beta_end (float, defaults to 0.02) — The final beta value.
beta_schedule (str, defaults to "linear") — The beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from linear, scaled_linear, or squaredcos_cap_v2.
trained_betas (np.ndarray, optional) — Pass an array of betas directly to the constructor to bypass beta_start and beta_end.
clip_sample (bool, defaults to True) — Clip the predicted sample for numerical stability.
clip_sample_range (float, defaults to 1.0) — The maximum magnitude for sample clipping. Valid only when clip_sample=True.
set_alpha_to_one (bool, defaults to True) — Each diffusion step uses the alphas product value at that step and at the previous one. For the final step there is no previous alpha. When this option is True the previous alpha product is fixed to 1, otherwise it uses the alpha value at step 0.
steps_offset (int, defaults to 0) — An offset added to the inference steps. You can use a combination of offset=1 and set_alpha_to_one=False to make the last step use step 0 for the previous alpha product like in Stable Diffusion.
prediction_type (str, defaults to epsilon, optional) — Prediction type of the scheduler function; can be epsilon (predicts the noise of the diffusion process), sample (directly predicts the noisy sample) or v_prediction` (see section 2.4 of Imagen Video paper).
thresholding (bool, defaults to False) — Whether to use the “dynamic thresholding” method. This is unsuitable for latent-space diffusion models such as Stable Diffusion.
dynamic_thresholding_ratio (float, defaults to 0.995) — The ratio for the dynamic thresholding method. Valid only when thresholding=True.
sample_max_value (float, defaults to 1.0) — The threshold value for dynamic thresholding. Valid only when thresholding=True.
timestep_spacing (str, defaults to "leading") — The way the timesteps should be scaled. Refer to Table 2 of the Common Diffusion Noise Schedules and Sample Steps are Flawed for more information.
rescale_betas_zero_snr (bool, defaults to False) — Whether to rescale the betas to have zero terminal SNR. This enables the model to generate very bright and dark samples instead of limiting it to samples with medium brightness. Loosely related to --offset_noise.

Extends Diffusers’ DDIMScheduler to run optimally on Gaudi:

All time-dependent parameters are generated at the beginning
At each time step, tensors are rolled to update the values of the time-dependent parameters

get_params

< source >

( timestep: Optional = None )

Parameters

timestep (int, optional) — The current discrete timestep in the diffusion chain. Optionally used to initialize parameters in cases which start in the middle of the denoising schedule (e.g. for image-to-image).

Initialize the time-dependent parameters, and retrieve the time-dependent parameters at each timestep. The tensors are rolled in a separate function at the end of the scheduler step in case parameters are retrieved multiple times in a timestep, e.g., when scaling model inputs and in the scheduler step.

roll_params

< source >

( )

Roll tensors to update the values of the time-dependent parameters at each timestep.

step

< source >

( model_output: FloatTensor timestep: int sample: FloatTensor eta: float = 0.0 use_clipped_model_output: bool = False generator = None variance_noise: Optional = None return_dict: bool = True ) → diffusers.schedulers.scheduling_utils.DDIMSchedulerOutput or tuple

Parameters

model_output (torch.FloatTensor) — The direct output from learned diffusion model.
sample (torch.FloatTensor) — A current instance of a sample created by the diffusion process.
eta (float) — The weight of noise for added noise in diffusion step.
use_clipped_model_output (bool, defaults to False) — If True, computes “corrected” model_output from the clipped predicted original sample. Necessary because predicted original sample is clipped to [-1, 1] when self.config.clip_sample is True. If no clipping has happened, “corrected” model_output would coincide with the one provided as input and use_clipped_model_output has no effect.
generator (torch.Generator, optional) — A random number generator.
variance_noise (torch.FloatTensor) — Alternative to generating noise with generator by directly providing the noise for the variance itself. Useful for methods such as CycleDiffusion.
return_dict (bool, optional, defaults to True) — Whether or not to return a DDIMSchedulerOutput or tuple.

Returns

diffusers.schedulers.scheduling_utils.DDIMSchedulerOutput or tuple

If return_dict is True, DDIMSchedulerOutput is returned, otherwise a tuple is returned where the first element is the sample tensor.

Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion process from the learned model outputs (most often the predicted noise).

GaudiStableDiffusionXLPipeline

The GaudiStableDiffusionXLPipeline class enables to perform text-to-image generation on HPUs using SDXL models. It inherits from the GaudiDiffusionPipeline class that is the parent to any kind of diffuser pipeline.

To get the most out of it, it should be associated with a scheduler that is optimized for HPUs like GaudiDDIMScheduler. Recommended schedulers are GaudiEulerDiscreteScheduler for SDXL base and GaudiEulerAncestralDiscreteScheduler for SDXL turbo.

GaudiStableDiffusionXLPipeline

class optimum.habana.diffusers.GaudiStableDiffusionXLPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers image_encoder: CLIPVisionModelWithProjection = None feature_extractor: CLIPImageProcessor = None force_zeros_for_empty_prompt: bool = True use_habana: bool = False use_hpu_graphs: bool = False gaudi_config: Union = None bf16_full_eval: bool = False )

Parameters

vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder (CLIPTextModel) — Frozen text-encoder. Stable Diffusion XL uses the text portion of CLIP, specifically the clip-vit-large-patch14 variant.
text_encoder_2 ( CLIPTextModelWithProjection) — Second frozen text-encoder. Stable Diffusion XL uses the text and pool portion of CLIP, specifically the laion/CLIP-ViT-bigG-14-laion2B-39B-b160k variant.
tokenizer (CLIPTokenizer) — Tokenizer of class CLIPTokenizer.
tokenizer_2 (CLIPTokenizer) — Second Tokenizer of class CLIPTokenizer.
unet (UNet2DConditionModel) — Conditional U-Net architecture to denoise the encoded image latents.
scheduler (SchedulerMixin) — A scheduler to be used in combination with unet to denoise the encoded image latents. Can be one of DDIMScheduler, LMSDiscreteScheduler, or PNDMScheduler.
force_zeros_for_empty_prompt (bool, optional, defaults to "True") — Whether the negative prompt embeddings shall be forced to always be set to 0. Also see the config of stabilityai/stable-diffusion-xl-base-1-0.
use_habana (bool, defaults to False) — Whether to use Gaudi (True) or CPU (False).
use_hpu_graphs (bool, defaults to False) — Whether to use HPU graphs or not.
gaudi_config (Union[str, GaudiConfig], defaults to None) — Gaudi configuration to use. Can be a string to download it from the Hub. Or a previously initialized config can be passed.
bf16_full_eval (bool, defaults to False) — Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory compared to fp32/mixed precision but can harm generated images.

Pipeline for text-to-image generation using Stable Diffusion XL on Gaudi devices Adapted from: https://github.com/huggingface/diffusers/blob/v0.23.1/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L96

Extends the StableDiffusionXLPipeline class:

Generation is performed by batches
Two mark_step() were added to add support for lazy mode
Added support for HPU graphs

call

< source >

( prompt: Union = None prompt_2: Union = None height: Optional = None width: Optional = None num_inference_steps: int = 50 timesteps: List = None denoising_end: Optional = None guidance_scale: float = 5.0 negative_prompt: Union = None negative_prompt_2: Union = None num_images_per_prompt: Optional = 1 batch_size: int = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None pooled_prompt_embeds: Optional = None negative_pooled_prompt_embeds: Optional = None ip_adapter_image: Union = None output_type: Optional = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None guidance_rescale: float = 0.0 original_size: Optional = None crops_coords_top_left: Tuple = (0, 0) target_size: Optional = None negative_original_size: Optional = None negative_crops_coords_top_left: Tuple = (0, 0) negative_target_size: Optional = None clip_skip: Optional = None callback_on_step_end: Optional = None callback_on_step_end_tensor_inputs: List = ['latents', 'prompt_embeds', 'negative_prompt_embeds', 'add_text_embeds', 'add_time_ids', 'negative_pooled_prompt_embeds', 'negative_add_time_ids'] **kwargs ) → #~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
prompt_2 (str or List[str], optional) — The prompt or prompts to be sent to the tokenizer_2 and text_encoder_2. If not defined, prompt is used in both text-encoders
height (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image. This is set to 1024 by default for the best results. Anything below 512 pixels won’t work well for stabilityai/stable-diffusion-xl-base-1.0 and checkpoints that are not specifically fine-tuned on low resolutions.
width (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image. This is set to 1024 by default for the best results. Anything below 512 pixels won’t work well for stabilityai/stable-diffusion-xl-base-1.0 and checkpoints that are not specifically fine-tuned on low resolutions.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
timesteps (List[int], optional) — Custom timesteps to use for the denoising process with schedulers which support a timesteps argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used. Must be in descending order.
denoising_end (float, optional) — When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise as determined by the discrete timesteps selected by the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in Refining the Image Output
guidance_scale (float, optional, defaults to 5.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
negative_prompt_2 (str or List[str], optional) — The prompt or prompts not to guide the image generation to be sent to tokenizer_2 and text_encoder_2. If not defined, negative_prompt is used in both text-encoders
num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
batch_size (int, optional, defaults to 1) — The number of images in a batch.
eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.
prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
pooled_prompt_embeds (torch.FloatTensor, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated from prompt input argument.
negative_pooled_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated from negative_prompt input argument. ip_adapter_image — (PipelineImageInput, optional): Optional image input to work with IP Adapters.
output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — #Whether or not to return a ~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput instead Whether or not to return a GaudiStableDiffusionXLPipelineOutput instead of a plain tuple.
cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
guidance_rescale (float, optional, defaults to 0.0) — Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawed guidance_scale is defined as φ in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR.
original_size (Tuple[int], optional, defaults to (1024, 1024)) — If original_size is not the same as target_size the image will appear to be down- or upsampled. original_size defaults to (height, width) if not specified. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952.
crops_coords_top_left (Tuple[int], optional, defaults to (0, 0)) — crops_coords_top_left can be used to generate an image that appears to be “cropped” from the position crops_coords_top_left downwards. Favorable, well-centered images are usually achieved by setting crops_coords_top_left to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952.
target_size (Tuple[int], optional, defaults to (1024, 1024)) — For most cases, target_size should be set to the desired height and width of the generated image. If not specified it will default to (height, width). Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952.
negative_original_size (Tuple[int], optional, defaults to (1024, 1024)) — To negatively condition the generation process based on a specific image resolution. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208.
negative_crops_coords_top_left (Tuple[int], optional, defaults to (0, 0)) — To negatively condition the generation process based on a specific crop coordinates. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208.
negative_target_size (Tuple[int], optional, defaults to (1024, 1024)) — To negatively condition the generation process based on a target image resolution. It should be as same as the target_size for most cases. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208.
callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

Returns

#~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput or tuple

#~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput if return_dict is True, otherwise a GaudiStableDiffusionXLPipelineOutput or tuple: GaudiStableDiffusionXLPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

GaudiEulerDiscreteScheduler

class optimum.habana.diffusers.GaudiEulerDiscreteScheduler

< source >

( num_train_timesteps: int = 1000 beta_start: float = 0.0001 beta_end: float = 0.02 beta_schedule: str = 'linear' trained_betas: Union = None prediction_type: str = 'epsilon' interpolation_type: str = 'linear' use_karras_sigmas: Optional = False sigma_min: Optional = None sigma_max: Optional = None timestep_spacing: str = 'linspace' timestep_type: str = 'discrete' steps_offset: int = 0 rescale_betas_zero_snr: bool = False )

Parameters

num_train_timesteps (int, defaults to 1000) — The number of diffusion steps to train the model.
beta_start (float, defaults to 0.0001) — The starting beta value of inference.
beta_end (float, defaults to 0.02) — The final beta value.
beta_schedule (str, defaults to "linear") — The beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from linear or scaled_linear.
trained_betas (np.ndarray, optional) — Pass an array of betas directly to the constructor to bypass beta_start and beta_end.
prediction_type (str, defaults to epsilon, optional) — Prediction type of the scheduler function; can be epsilon (predicts the noise of the diffusion process), sample (directly predicts the noisy sample) or v_prediction` (see section 2.4 of Imagen Video paper).
interpolation_type(str, defaults to "linear", optional) — The interpolation type to compute intermediate sigmas for the scheduler denoising steps. Should be on of "linear" or "log_linear".
use_karras_sigmas (bool, optional, defaults to False) — Whether to use Karras sigmas for step sizes in the noise schedule during the sampling process. If True, the sigmas are determined according to a sequence of noise levels {σi}.
timestep_spacing (str, defaults to "linspace") — The way the timesteps should be scaled. Refer to Table 2 of the Common Diffusion Noise Schedules and Sample Steps are Flawed for more information.
steps_offset (int, defaults to 0) — An offset added to the inference steps. You can use a combination of offset=1 and set_alpha_to_one=False to make the last step use step 0 for the previous alpha product like in Stable Diffusion.
rescale_betas_zero_snr (bool, defaults to False) — Whether to rescale the betas to have zero terminal SNR. This enables the model to generate very bright and dark samples instead of limiting it to samples with medium brightness. Loosely related to --offset_noise.

Extends Diffusers’ EulerDiscreteScheduler to run optimally on Gaudi:

All time-dependent parameters are generated at the beginning
At each time step, tensors are rolled to update the values of the time-dependent parameters

roll_params

< source >

( )

Roll tensors to update the values of the time-dependent parameters at each timestep.

GaudiEulerAncestralDiscreteScheduler

class optimum.habana.diffusers.GaudiEulerAncestralDiscreteScheduler

< source >

Parameters

num_train_timesteps (int, defaults to 1000) — The number of diffusion steps to train the model.
beta_start (float, defaults to 0.0001) — The starting beta value of inference.
beta_end (float, defaults to 0.02) — The final beta value.
beta_schedule (str, defaults to "linear") — The beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from linear or scaled_linear.
trained_betas (np.ndarray, optional) — Pass an array of betas directly to the constructor to bypass beta_start and beta_end.
prediction_type (str, defaults to epsilon, optional) — Prediction type of the scheduler function; can be epsilon (predicts the noise of the diffusion process), sample (directly predicts the noisy sample) or v_prediction` (see section 2.4 of Imagen Video paper).
timestep_spacing (str, defaults to "linspace") — The way the timesteps should be scaled. Refer to Table 2 of the Common Diffusion Noise Schedules and Sample Steps are Flawed for more information.
steps_offset (int, defaults to 0) — An offset added to the inference steps. You can use a combination of offset=1 and set_alpha_to_one=False to make the last step use step 0 for the previous alpha product like in Stable Diffusion.
rescale_betas_zero_snr (bool, defaults to False) — Whether to rescale the betas to have zero terminal SNR. This enables the model to generate very bright and dark samples instead of limiting it to samples with medium brightness. Loosely related to --offset_noise.

Extends Diffusers’ EulerAncestralDiscreteScheduler to run optimally on Gaudi:

All time-dependent parameters are generated at the beginning
At each time step, tensors are rolled to update the values of the time-dependent parameters

get_params

< source >

( timestep: Union )

Parameters

timestep (float) — The current discrete timestep in the diffusion chain. Optionally used to initialize parameters in cases which start in the middle of the denoising schedule (e.g. for image-to-image)

roll_params

< source >

( )

Roll tensors to update the values of the time-dependent parameters at each timestep.

GaudiStableDiffusionUpscalePipeline

The GaudiStableDiffusionUpscalePipeline is used to enhance the resolution of input images by a factor of 4 on HPUs. It inherits from the GaudiDiffusionPipeline class that is the parent to any kind of diffuser pipeline.

class optimum.habana.diffusers.GaudiStableDiffusionUpscalePipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel low_res_scheduler: DDPMScheduler scheduler: KarrasDiffusionSchedulers safety_checker: Optional = None feature_extractor: Optional = None watermarker: Optional = None max_noise_level: int = 350 use_habana: bool = False use_hpu_graphs: bool = False gaudi_config: Union = None bf16_full_eval: bool = False )

Parameters

vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder (CLIPTextModel) — Frozen text-encoder. Stable Diffusion uses the text portion of CLIP, specifically the clip-vit-large-patch14 variant.
tokenizer (CLIPTokenizer) — Tokenizer of class CLIPTokenizer.
unet (UNet2DConditionModel) — Conditional U-Net architecture to denoise the encoded image latents.
low_res_scheduler (SchedulerMixin) — A scheduler used to add initial noise to the low resolution conditioning image. It must be an instance of DDPMScheduler.
scheduler (SchedulerMixin) — A scheduler to be used in combination with unet to denoise the encoded image latents. Can be one of DDIMScheduler, LMSDiscreteScheduler, or PNDMScheduler.
safety_checker (StableDiffusionSafetyChecker) — Classification module that estimates whether generated images could be considered offensive or harmful. Please, refer to the model card for details.
feature_extractor (CLIPImageProcessor) — Model that extracts features from generated images to be used as inputs for the safety_checker.
use_habana (bool, defaults to False) — Whether to use Gaudi (True) or CPU (False).
use_hpu_graphs (bool, defaults to False) — Whether to use HPU graphs or not.
gaudi_config (Union[str, GaudiConfig], defaults to None) — Gaudi configuration to use. Can be a string to download it from the Hub. Or a previously initialized config can be passed.
bf16_full_eval (bool, defaults to False) — Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory compared to fp32/mixed precision but can harm generated images.

Pipeline for text-guided image super-resolution using Stable Diffusion 2.

Adapted from: https://github.com/huggingface/diffusers/blob/v0.23.1/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_upscale.py#L70

Generation is performed by batches
Two mark_step() were added to add support for lazy mode
Added support for HPU graphs

call

< source >

( prompt: Union = None image: Union = None num_inference_steps: int = 75 guidance_scale: float = 9.0 noise_level: int = 20 negative_prompt: Union = None num_images_per_prompt: Optional = 1 batch_size: int = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None output_type: Optional = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None clip_skip: int = None **kwargs ) → GaudiStableDiffusionPipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
image (torch.FloatTensor, PIL.Image.Image, np.ndarray, List[torch.FloatTensor], List[PIL.Image.Image], or List[np.ndarray]) — Image or tensor representing an image batch to be upscaled.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
guidance_scale (float, optional, defaults to 7.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
batch_size (int, optional, defaults to 1) — The number of images in a batch.
eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated randomly.
prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a GaudiStableDiffusionPipelineOutput instead of a plain tuple.
callback (Callable, optional) — A function that will be called every callback_steps steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).
callback_steps (int, optional, defaults to 1) — The frequency at which the callback function will be called. If not specified, the callback will be called at every step.
cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.cross_attention.
clip_skip (int, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.

Returns

GaudiStableDiffusionPipelineOutput or tuple

GaudiStableDiffusionPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents “not-safe-for-work” (nsfw) content, according to the safety_checker.

Function invoked when calling the pipeline for generation.

Examples:

>>> import requests   #TODO to test?
>>> from PIL import Image
>>> from io import BytesIO
>>> from optimum.habana.diffusers import GaudiStableDiffusionUpscalePipeline
>>> import torch

>>> # load model and scheduler
>>> model_id = "stabilityai/stable-diffusion-x4-upscaler"
>>> pipeline = GaudiStableDiffusionUpscalePipeline.from_pretrained(
...     model_id, revision="fp16", torch_dtype=torch.bfloat16
... )
>>> pipeline = pipeline.to("cuda")

>>> # let's download an  image
>>> url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
>>> response = requests.get(url)
>>> low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
>>> low_res_img = low_res_img.resize((128, 128))
>>> prompt = "a white cat"

>>> upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
>>> upscaled_image.save("upsampled_cat.png")

Optimum

GaudiStableDiffusionPipeline