Diffusers documentation
aMUSEd
aMUSEd
Amused is a lightweight text to image model based off of the muse architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder CLIP-L/14 instead of t5-xxl. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
| Model | Params | 
|---|---|
| amused-256 | 603M | 
| amused-512 | 608M | 
AmusedPipeline
class diffusers.AmusedPipeline
< source >( vqvae: VQModel tokenizer: CLIPTokenizer text_encoder: CLIPTextModelWithProjection transformer: UVit2DModel scheduler: AmusedScheduler )
__call__
< source >( prompt: Union = None height: Optional = None width: Optional = None num_inference_steps: int = 12 guidance_scale: float = 10.0 negative_prompt: Union = None num_images_per_prompt: Optional = 1 generator: Optional = None latents: Optional = None prompt_embeds: Optional = None encoder_hidden_states: Optional = None negative_prompt_embeds: Optional = None negative_encoder_hidden_states: Optional = None output_type = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None micro_conditioning_aesthetic_score: int = 6 micro_conditioning_crop_coord: Tuple = (0, 0) temperature: Union = (2, 0)  ) → ImagePipelineOutput or tuple
Parameters
-  prompt (strorList[str], optional) — The prompt or prompts to guide image generation. If not defined, you need to passprompt_embeds.
-  height (int, optional, defaults toself.transformer.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image.
-  width (int, optional, defaults toself.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image.
-  num_inference_steps (int, optional, defaults to 16) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
-  guidance_scale (float, optional, defaults to 10.0) — A higher guidance scale value encourages the model to generate images closely linked to the textpromptat the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1.
-  negative_prompt (strorList[str], optional) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to passnegative_prompt_embedsinstead. Ignored when not using guidance (guidance_scale < 1).
-  num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
-  generator (torch.Generator, optional) — Atorch.Generatorto make generation deterministic.
-  latents (torch.IntTensor, optional) — Pre-generated tokens representing latent vectors inself.vqvae, to be used as inputs for image gneration. If not provided, the starting latents will be completely masked.
-  prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from thepromptinput argument. A single vector from the pooled and projected final hidden states.
-  encoder_hidden_states (torch.FloatTensor, optional) — Pre-generated penultimate hidden states from the text encoder providing additional text conditioning.
-  negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embedsare generated from thenegative_promptinput argument.
-  negative_encoder_hidden_states (torch.FloatTensor, optional) — Analogous toencoder_hidden_statesfor the positive prompt.
-  output_type (str, optional, defaults to"pil") — The output format of the generated image. Choose betweenPIL.Imageornp.array.
-  return_dict (bool, optional, defaults toTrue) — Whether or not to return a StableDiffusionPipelineOutput instead of a plain tuple.
-  callback (Callable, optional) — A function that calls everycallback_stepssteps during inference. The function is called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor).
-  callback_steps (int, optional, defaults to 1) — The frequency at which thecallbackfunction is called. If not specified, the callback is called at every step.
-  cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined inself.processor.
-  micro_conditioning_aesthetic_score (int, optional, defaults to 6) — The targeted aesthetic score according to the laion aesthetic classifier. See https://laion.ai/blog/laion-aesthetics/ and the micro-conditioning section of https://arxiv.org/abs/2307.01952.
-  micro_conditioning_crop_coord (Tuple[int], optional, defaults to (0, 0)) — The targeted height, width crop coordinates. See the micro-conditioning section of https://arxiv.org/abs/2307.01952.
-  temperature (Union[int, Tuple[int, int], List[int]], optional, defaults to (2, 0)) — Configures the temperature scheduler onself.schedulerseeAmusedScheduler#set_timesteps.
Returns
ImagePipelineOutput or tuple
If return_dict is True, ImagePipelineOutput is returned, otherwise a
tuple is returned where the first element is a list with the generated images.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import AmusedPipeline
>>> pipe = AmusedPipeline.from_pretrained(
...     "amused/amused-512", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")
>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> image = pipe(prompt).images[0]enable_xformers_memory_efficient_attention
< source >( attention_op: Optional = None )
Parameters
-  attention_op (Callable, optional) — Override the defaultNoneoperator for use asopargument to thememory_efficient_attention()function of xFormers.
Enable memory efficient attention from xFormers. When this option is enabled, you should observe lower GPU memory usage and a potential speed up during inference. Speed up during training is not guaranteed.
⚠️ When memory efficient attention and sliced attention are both enabled, memory efficient attention takes precedent.
Examples:
>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from xformers.ops import MemoryEfficientAttentionFlashAttentionOp
>>> pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)
>>> # Workaround for not accepting attention shape using VAE for Flash Attention
>>> pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)Disable memory efficient attention from xFormers.
class diffusers.AmusedImg2ImgPipeline
< source >( vqvae: VQModel tokenizer: CLIPTokenizer text_encoder: CLIPTextModelWithProjection transformer: UVit2DModel scheduler: AmusedScheduler )
__call__
< source >( prompt: Union = None image: Union = None strength: float = 0.5 num_inference_steps: int = 12 guidance_scale: float = 10.0 negative_prompt: Union = None num_images_per_prompt: Optional = 1 generator: Optional = None prompt_embeds: Optional = None encoder_hidden_states: Optional = None negative_prompt_embeds: Optional = None negative_encoder_hidden_states: Optional = None output_type = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None micro_conditioning_aesthetic_score: int = 6 micro_conditioning_crop_coord: Tuple = (0, 0) temperature: Union = (2, 0)  ) → ImagePipelineOutput or tuple
Parameters
-  prompt (strorList[str], optional) — The prompt or prompts to guide image generation. If not defined, you need to passprompt_embeds.
-  image (torch.FloatTensor,PIL.Image.Image,np.ndarray,List[torch.FloatTensor],List[PIL.Image.Image], orList[np.ndarray]) —Image, numpy array or tensor representing an image batch to be used as the starting point. For both numpy array and pytorch tensor, the expected value range is between[0, 1]If it’s a tensor or a list or tensors, the expected shape should be(B, C, H, W)or(C, H, W). If it is a numpy array or a list of arrays, the expected shape should be(B, H, W, C)or(H, W, C)It can also accept image latents asimage, but if passing latents directly it is not encoded again.
-  strength (float, optional, defaults to 0.5) — Indicates extent to transform the referenceimage. Must be between 0 and 1.imageis used as a starting point and more noise is added the higher thestrength. The number of denoising steps depends on the amount of noise initially added. Whenstrengthis 1, added noise is maximum and the denoising process runs for the full number of iterations specified innum_inference_steps. A value of 1 essentially ignoresimage.
-  num_inference_steps (int, optional, defaults to 16) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
-  guidance_scale (float, optional, defaults to 10.0) — A higher guidance scale value encourages the model to generate images closely linked to the textpromptat the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1.
-  negative_prompt (strorList[str], optional) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to passnegative_prompt_embedsinstead. Ignored when not using guidance (guidance_scale < 1).
-  num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
-  generator (torch.Generator, optional) — Atorch.Generatorto make generation deterministic.
-  prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from thepromptinput argument. A single vector from the pooled and projected final hidden states.
-  encoder_hidden_states (torch.FloatTensor, optional) — Pre-generated penultimate hidden states from the text encoder providing additional text conditioning.
-  negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embedsare generated from thenegative_promptinput argument.
-  negative_encoder_hidden_states (torch.FloatTensor, optional) — Analogous toencoder_hidden_statesfor the positive prompt.
-  output_type (str, optional, defaults to"pil") — The output format of the generated image. Choose betweenPIL.Imageornp.array.
-  return_dict (bool, optional, defaults toTrue) — Whether or not to return a StableDiffusionPipelineOutput instead of a plain tuple.
-  callback (Callable, optional) — A function that calls everycallback_stepssteps during inference. The function is called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor).
-  callback_steps (int, optional, defaults to 1) — The frequency at which thecallbackfunction is called. If not specified, the callback is called at every step.
-  cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined inself.processor.
-  micro_conditioning_aesthetic_score (int, optional, defaults to 6) — The targeted aesthetic score according to the laion aesthetic classifier. See https://laion.ai/blog/laion-aesthetics/ and the micro-conditioning section of https://arxiv.org/abs/2307.01952.
-  micro_conditioning_crop_coord (Tuple[int], optional, defaults to (0, 0)) — The targeted height, width crop coordinates. See the micro-conditioning section of https://arxiv.org/abs/2307.01952.
-  temperature (Union[int, Tuple[int, int], List[int]], optional, defaults to (2, 0)) — Configures the temperature scheduler onself.schedulerseeAmusedScheduler#set_timesteps.
Returns
ImagePipelineOutput or tuple
If return_dict is True, ImagePipelineOutput is returned, otherwise a
tuple is returned where the first element is a list with the generated images.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import AmusedImg2ImgPipeline
>>> from diffusers.utils import load_image
>>> pipe = AmusedImg2ImgPipeline.from_pretrained(
...     "amused/amused-512", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")
>>> prompt = "winter mountains"
>>> input_image = (
...     load_image(
...         "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains.jpg"
...     )
...     .resize((512, 512))
...     .convert("RGB")
... )
>>> image = pipe(prompt, input_image).images[0]enable_xformers_memory_efficient_attention
< source >( attention_op: Optional = None )
Parameters
-  attention_op (Callable, optional) — Override the defaultNoneoperator for use asopargument to thememory_efficient_attention()function of xFormers.
Enable memory efficient attention from xFormers. When this option is enabled, you should observe lower GPU memory usage and a potential speed up during inference. Speed up during training is not guaranteed.
⚠️ When memory efficient attention and sliced attention are both enabled, memory efficient attention takes precedent.
Examples:
>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from xformers.ops import MemoryEfficientAttentionFlashAttentionOp
>>> pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)
>>> # Workaround for not accepting attention shape using VAE for Flash Attention
>>> pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)Disable memory efficient attention from xFormers.
class diffusers.AmusedInpaintPipeline
< source >( vqvae: VQModel tokenizer: CLIPTokenizer text_encoder: CLIPTextModelWithProjection transformer: UVit2DModel scheduler: AmusedScheduler )
__call__
< source >( prompt: Union = None image: Union = None mask_image: Union = None strength: float = 1.0 num_inference_steps: int = 12 guidance_scale: float = 10.0 negative_prompt: Union = None num_images_per_prompt: Optional = 1 generator: Optional = None prompt_embeds: Optional = None encoder_hidden_states: Optional = None negative_prompt_embeds: Optional = None negative_encoder_hidden_states: Optional = None output_type = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None micro_conditioning_aesthetic_score: int = 6 micro_conditioning_crop_coord: Tuple = (0, 0) temperature: Union = (2, 0)  ) → ImagePipelineOutput or tuple
Parameters
-  prompt (strorList[str], optional) — The prompt or prompts to guide image generation. If not defined, you need to passprompt_embeds.
-  image (torch.FloatTensor,PIL.Image.Image,np.ndarray,List[torch.FloatTensor],List[PIL.Image.Image], orList[np.ndarray]) —Image, numpy array or tensor representing an image batch to be used as the starting point. For both numpy array and pytorch tensor, the expected value range is between[0, 1]If it’s a tensor or a list or tensors, the expected shape should be(B, C, H, W)or(C, H, W). If it is a numpy array or a list of arrays, the expected shape should be(B, H, W, C)or(H, W, C)It can also accept image latents asimage, but if passing latents directly it is not encoded again.
-  mask_image (torch.FloatTensor,PIL.Image.Image,np.ndarray,List[torch.FloatTensor],List[PIL.Image.Image], orList[np.ndarray]) —Image, numpy array or tensor representing an image batch to maskimage. White pixels in the mask are repainted while black pixels are preserved. Ifmask_imageis a PIL image, it is converted to a single channel (luminance) before use. If it’s a numpy array or pytorch tensor, it should contain one color channel (L) instead of 3, so the expected shape for pytorch tensor would be(B, 1, H, W),(B, H, W),(1, H, W),(H, W). And for numpy array would be for(B, H, W, 1),(B, H, W),(H, W, 1), or(H, W).
-  strength (float, optional, defaults to 1.0) — Indicates extent to transform the referenceimage. Must be between 0 and 1.imageis used as a starting point and more noise is added the higher thestrength. The number of denoising steps depends on the amount of noise initially added. Whenstrengthis 1, added noise is maximum and the denoising process runs for the full number of iterations specified innum_inference_steps. A value of 1 essentially ignoresimage.
-  num_inference_steps (int, optional, defaults to 16) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
-  guidance_scale (float, optional, defaults to 10.0) — A higher guidance scale value encourages the model to generate images closely linked to the textpromptat the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1.
-  negative_prompt (strorList[str], optional) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to passnegative_prompt_embedsinstead. Ignored when not using guidance (guidance_scale < 1).
-  num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
-  generator (torch.Generator, optional) — Atorch.Generatorto make generation deterministic.
-  prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from thepromptinput argument. A single vector from the pooled and projected final hidden states.
-  encoder_hidden_states (torch.FloatTensor, optional) — Pre-generated penultimate hidden states from the text encoder providing additional text conditioning.
-  negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embedsare generated from thenegative_promptinput argument.
-  negative_encoder_hidden_states (torch.FloatTensor, optional) — Analogous toencoder_hidden_statesfor the positive prompt.
-  output_type (str, optional, defaults to"pil") — The output format of the generated image. Choose betweenPIL.Imageornp.array.
-  return_dict (bool, optional, defaults toTrue) — Whether or not to return a StableDiffusionPipelineOutput instead of a plain tuple.
-  callback (Callable, optional) — A function that calls everycallback_stepssteps during inference. The function is called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor).
-  callback_steps (int, optional, defaults to 1) — The frequency at which thecallbackfunction is called. If not specified, the callback is called at every step.
-  cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined inself.processor.
-  micro_conditioning_aesthetic_score (int, optional, defaults to 6) — The targeted aesthetic score according to the laion aesthetic classifier. See https://laion.ai/blog/laion-aesthetics/ and the micro-conditioning section of https://arxiv.org/abs/2307.01952.
-  micro_conditioning_crop_coord (Tuple[int], optional, defaults to (0, 0)) — The targeted height, width crop coordinates. See the micro-conditioning section of https://arxiv.org/abs/2307.01952.
-  temperature (Union[int, Tuple[int, int], List[int]], optional, defaults to (2, 0)) — Configures the temperature scheduler onself.schedulerseeAmusedScheduler#set_timesteps.
Returns
ImagePipelineOutput or tuple
If return_dict is True, ImagePipelineOutput is returned, otherwise a
tuple is returned where the first element is a list with the generated images.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import AmusedInpaintPipeline
>>> from diffusers.utils import load_image
>>> pipe = AmusedInpaintPipeline.from_pretrained(
...     "amused/amused-512", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")
>>> prompt = "fall mountains"
>>> input_image = (
...     load_image(
...         "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains_1.jpg"
...     )
...     .resize((512, 512))
...     .convert("RGB")
... )
>>> mask = (
...     load_image(
...         "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains_1_mask.png"
...     )
...     .resize((512, 512))
...     .convert("L")
... )
>>> pipe(prompt, input_image, mask).images[0].save("out.png")enable_xformers_memory_efficient_attention
< source >( attention_op: Optional = None )
Parameters
-  attention_op (Callable, optional) — Override the defaultNoneoperator for use asopargument to thememory_efficient_attention()function of xFormers.
Enable memory efficient attention from xFormers. When this option is enabled, you should observe lower GPU memory usage and a potential speed up during inference. Speed up during training is not guaranteed.
⚠️ When memory efficient attention and sliced attention are both enabled, memory efficient attention takes precedent.
Examples:
>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from xformers.ops import MemoryEfficientAttentionFlashAttentionOp
>>> pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)
>>> # Workaround for not accepting attention shape using VAE for Flash Attention
>>> pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)Disable memory efficient attention from xFormers.