Optimum documentation

GaudiStableDiffusionPipeline

You are viewing v1.13.1 version. A newer version v1.23.3 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

GaudiStableDiffusionPipeline

The GaudiStableDiffusionPipeline class enables to perform text-to-image generation on HPUs. It inherits from the GaudiDiffusionPipeline class that is the parent to any kind of diffuser pipeline.

To get the most out of it, it should be associated with a scheduler that is optimized for HPUs like GaudiDDIMScheduler.

GaudiStableDiffusionPipeline

class optimum.habana.diffusers.GaudiStableDiffusionPipeline

< >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True use_habana: bool = False use_hpu_graphs: bool = False gaudi_config: typing.Union[str, optimum.habana.transformers.gaudi_configuration.GaudiConfig] = None bf16_full_eval: bool = False )

Parameters

  • vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
  • text_encoder (CLIPTextModel) — Frozen text-encoder. Stable Diffusion uses the text portion of CLIP, specifically the clip-vit-large-patch14 variant.
  • tokenizer (CLIPTokenizer) — Tokenizer of class CLIPTokenizer.
  • unet (UNet2DConditionModel) — Conditional U-Net architecture to denoise the encoded image latents.
  • scheduler (SchedulerMixin) — A scheduler to be used in combination with unet to denoise the encoded image latents. Can be one of DDIMScheduler, LMSDiscreteScheduler, or PNDMScheduler.
  • safety_checker (StableDiffusionSafetyChecker) — Classification module that estimates whether generated images could be considered offensive or harmful. Please, refer to the model card for details.
  • feature_extractor (CLIPImageProcessor) — Model that extracts features from generated images to be used as inputs for the safety_checker.
  • use_habana (bool, defaults to False) — Whether to use Gaudi (True) or CPU (False).
  • use_hpu_graphs (bool, defaults to False) — Whether to use HPU graphs or not.
  • gaudi_config (Union[str, GaudiConfig], defaults to None) — Gaudi configuration to use. Can be a string to download it from the Hub. Or a previously initialized config can be passed.
  • bf16_full_eval (bool, defaults to False) — Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory compared to fp32/mixed precision but can harm generated images.

Extends the StableDiffusionPipeline class:

  • Generation is performed by batches
  • Two mark_step() were added to add support for lazy mode
  • Added support for HPU graphs

__call__

< >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 batch_size: int = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None guidance_rescale: float = 0.0 ) GaudiStableDiffusionPipelineOutput or tuple

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • height (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated images.
  • width (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated images.
  • num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • guidance_scale (float, optional, defaults to 7.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • batch_size (int, optional, defaults to 1) — The number of images in a batch.
  • eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
  • generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
  • latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated randomly.
  • prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a GaudiStableDiffusionPipelineOutput instead of a plain tuple.
  • callback (Callable, optional) — A function that will be called every callback_steps steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).
  • callback_steps (int, optional, defaults to 1) — The frequency at which the callback function will be called. If not specified, the callback will be called at every step.
  • cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.cross_attention.
  • guidance_rescale (float, optional, defaults to 0.7) — Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawed guidance_scale is defined as φ in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR.

Returns

GaudiStableDiffusionPipelineOutput or tuple

GaudiStableDiffusionPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents “not-safe-for-work” (nsfw) content, according to the safety_checker.

Function invoked when calling the pipeline for generation.

GaudiDiffusionPipeline

class optimum.habana.diffusers.GaudiDiffusionPipeline

< >

( use_habana: bool = False use_hpu_graphs: bool = False gaudi_config: typing.Union[str, optimum.habana.transformers.gaudi_configuration.GaudiConfig] = None bf16_full_eval: bool = False )

Parameters

  • use_habana (bool, defaults to False) — Whether to use Gaudi (True) or CPU (False).
  • use_hpu_graphs (bool, defaults to False) — Whether to use HPU graphs or not.
  • gaudi_config (Union[str, GaudiConfig], defaults to None) — Gaudi configuration to use. Can be a string to download it from the Hub. Or a previously initialized config can be passed.
  • bf16_full_eval (bool, defaults to False) — Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory compared to fp32/mixed precision but can harm generated images.

Extends the DiffusionPipeline class:

  • The pipeline is initialized on Gaudi if use_habana=True.
  • The pipeline’s Gaudi configuration is saved and pushed to the hub.

from_pretrained

< >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike, NoneType] **kwargs )

More information here.

save_pretrained

< >

( save_directory: typing.Union[str, os.PathLike] safe_serialization: bool = False variant: typing.Optional[str] = None )

Parameters

  • save_directory (str or os.PathLike) — Directory to which to save. Will be created if it doesn’t exist.
  • safe_serialization (bool, optional, defaults to False) — Whether to save the model using safetensors or the traditional PyTorch way (that uses pickle).
  • variant (str, optional) — If specified, weights are saved in the format pytorch_model..bin.

Save the pipeline and Gaudi configurations. More information here.

GaudiDDIMScheduler

class optimum.habana.diffusers.GaudiDDIMScheduler

< >

( num_train_timesteps: int = 1000 beta_start: float = 0.0001 beta_end: float = 0.02 beta_schedule: str = 'linear' trained_betas: typing.Union[numpy.ndarray, typing.List[float], NoneType] = None clip_sample: bool = True set_alpha_to_one: bool = True steps_offset: int = 0 prediction_type: str = 'epsilon' thresholding: bool = False dynamic_thresholding_ratio: float = 0.995 clip_sample_range: float = 1.0 sample_max_value: float = 1.0 timestep_spacing: str = 'leading' rescale_betas_zero_snr: bool = False )

Parameters

  • num_train_timesteps (int) — number of diffusion steps used to train the model.
  • beta_start (float) — the starting beta value of inference.
  • beta_end (float) — the final beta value.
  • beta_schedule (str) — the beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from linear, scaled_linear, or squaredcos_cap_v2.
  • trained_betas (np.ndarray, optional) — option to pass an array of betas directly to the constructor to bypass beta_start, beta_end etc.
  • clip_sample (bool, default True) — option to clip predicted sample between -1 and 1 for numerical stability.
  • clip_sample_range (float, default 1.0) — the maximum magnitude for sample clipping. Valid only when clip_sample=True.
  • set_alpha_to_one (bool, default True) — each diffusion step uses the value of alphas product at that step and at the previous one. For the final step there is no previous alpha. When this option is True the previous alpha product is fixed to 1, otherwise it uses the value of alpha at step 0.
  • steps_offset (int, default 0) — an offset added to the inference steps. You can use a combination of offset=1 and set_alpha_to_one=False, to make the last step use step 0 for the previous alpha product, as done in stable diffusion.
  • prediction_type (str, default epsilon, optional) — prediction type of the scheduler function, one of epsilon (predicting the noise of the diffusion process), sample (directly predicting the noisy sample) or v_prediction` (see section 2.4 https://imagen.research.google/video/paper.pdf)
  • thresholding (bool, default False) — whether to use the “dynamic thresholding” method (introduced by Imagen, https://arxiv.org/abs/2205.11487). Note that the thresholding method is unsuitable for latent-space diffusion models (such as stable-diffusion).
  • dynamic_thresholding_ratio (float, default 0.995) — the ratio for the dynamic thresholding method. Default is 0.995, the same as Imagen (https://arxiv.org/abs/2205.11487). Valid only when thresholding=True.
  • sample_max_value (float, default 1.0) — the threshold value for dynamic thresholding. Valid only when thresholding=True.
  • timestep_spacing (str, default "leading") — The way the timesteps should be scaled. Refer to Table 2. of Common Diffusion Noise Schedules and Sample Steps are Flawed for more information.
  • rescale_betas_zero_snr (bool, default False) — whether to rescale the betas to have zero terminal SNR (proposed by https://arxiv.org/pdf/2305.08891.pdf). This can enable the model to generate very bright and dark samples instead of limiting it to samples with medium brightness. Loosely related to --offset_noise.

Extends Diffusers’ DDIMScheduler to run optimally on Gaudi:

  • All time-dependent parameters are generated at the beginning
  • At each time step, tensors are rolled to update the values of the time-dependent parameters

step

< >

( model_output: FloatTensor sample: FloatTensor eta: float = 0.0 use_clipped_model_output: bool = False generator = None variance_noise: typing.Optional[torch.FloatTensor] = None return_dict: bool = True ) diffusers.schedulers.scheduling_utils.DDIMSchedulerOutput or tuple

Parameters

  • model_output (torch.FloatTensor) — direct output from learned diffusion model.
  • sample (torch.FloatTensor) — current instance of sample being created by diffusion process.
  • eta (float) — weight of noise for added noise in diffusion step.
  • use_clipped_model_output (bool) — if True, compute “corrected” model_output from the clipped predicted original sample. Necessary because predicted original sample is clipped to [-1, 1] when self.config.clip_sample is True. If no clipping has happened, “corrected” model_output would coincide with the one provided as input and use_clipped_model_output will have not effect. generator — random number generator.
  • variance_noise (torch.FloatTensor) — instead of generating noise for the variance using generator, we can directly provide the noise for the variance itself. This is useful for methods such as CycleDiffusion. (https://arxiv.org/abs/2210.05559)
  • return_dict (bool) — option for returning tuple rather than DDIMSchedulerOutput class

Returns

diffusers.schedulers.scheduling_utils.DDIMSchedulerOutput or tuple

diffusers.schedulers.scheduling_utils.DDIMSchedulerOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

Predict the sample at the previous timestep by reversing the SDE. Core function to propagate the diffusion process from the learned model outputs (most often the predicted noise).