optimum-onnx documentation
ONNX Runtime Diffusion Pipelines
ONNX Runtime Diffusion Pipelines
Generic ORT Diffusion Pipeline classes
The following classes are available for instantiating a diffusion pipeline class without needing to specify the task or architecture.
ORTDiffusionPipeline
class optimum.onnxruntime.ORTDiffusionPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
Base class for all ONNX Runtime Pipelines.
ORTDiffusionPipeline
stores all components (models, schedulers, and processors) for diffusion pipelines and
provides methods for exporting, loading, downloading and saving models. It also includes methods to:
- move all ONNX Runtime sessions to the device of your choice
- enable/disable the progress bar for the denoising iteration
- handle ONNX Runtime io binding if used
Class attributes:
- config_name (
str
) — The configuration filename that stores the class and module names of all the diffusion pipeline’s components. - task (
str
) — A string that identifies the pipeline’s task. - library (
str
) — The library the pipeline is compatible with. - auto_model_class (
Type[DiffusionPipeline]
) — The corresponding/equivalent Diffusers pipeline class.
from_pretrained
< source >( model_name_or_path: str | Path export: bool | None = None provider: str = 'CPUExecutionProvider' providers: Sequence[str] | None = None provider_options: Sequence[dict[str, Any]] | dict[str, Any] | None = None session_options: SessionOptions | None = None use_io_binding: bool | None = None **kwargs ) → ORTDiffusionPipeline
Parameters
- model_name_or_path (
str
oros.PathLike
) — Path to a folder containing the model files or a hub repository id. - export (
bool
, optional, defaults toNone
) — Whether to export the model from Diffusers to ONNX. If left toNone
, the model is exported only if no ONNX files are found in themodel_name_or_path
folder. If set toTrue
, the model is always exported. If set toFalse
, the model is never exported. - provider (
str
, optional, defaults to"CPUExecutionProvider"
) — The execution provider for ONNX Runtime. Can be"CUDAExecutionProvider"
,"DmlExecutionProvider"
, etc. - providers (
Sequence[str]
, optional) — A list of execution providers for ONNX Runtime. Overridesprovider
. - provider_options (
Union[Sequence[Dict[str, Any]], Dict[str, Any]]
, optional) — Options for each execution provider. Can be a single dictionary for the first provider or a list of dictionaries for each provider. The order of the dictionaries should match the order of the providers. - session_options (
SessionOptions
, optional) — Options for the ONNX Runtime session. Can be used to set optimization levels, graph optimization, etc. - use_io_binding (
bool
, optional) — Whether to use IOBinding for the ONNX Runtime session. If set toTrue
, it will use IOBinding for input and output tensors. - **kwargs —
Can include the following:
- Export arguments (e.g.,
slim
,dtype
,device
,no_dynamic_axes
, etc.). - Hugging Face Hub arguments (e.g.,
revision
,cache_dir
,force_download
, etc.). - Preloaded models or sessions for the different components of the pipeline (e.g.,
vae_encoder_session
,vae_decoder_session
,unet_session
,transformer_session
,image_encoder
,safety_checker
, etc.).
- Export arguments (e.g.,
Returns
ORTDiffusionPipeline
The loaded pipeline with ONNX Runtime sessions.
Instantiates a ORTDiffusionPipeline
with ONNX Runtime sessions from a pretrained pipeline repo or directory.
This method can be used to export a diffusion pipeline to ONNX and/or load a pipeline with ONNX Runtime from a repo or a directory.
ORTPipelineForText2Image
ORTPipelineForText2Image
is a generic pipeline class that instantiates a text-to-image pipeline class.
The specific underlying pipeline class is automatically selected from either the
~ORTPipelineForText2Image.from_pretrained
or ~ORTPipelineForText2Image.from_pipe
methods.
This class cannot be instantiated using __init__()
(throws an error).
Class attributes:
- config_name (
str
) — The configuration filename that stores the class and module names of all the diffusion pipeline’s components. - auto_model_class (
Type[DiffusionPipeline]
) — The corresponding/equivalent Diffusers pipeline class. - ort_pipelines_mapping (
OrderedDict
) — The mapping between the model names/architectures and the corresponding ORT pipeline class.
ORTPipelineForImage2Image
ORTPipelineForImage2Image
is a generic pipeline class that instantiates an image-to-image pipeline class. The
specific underlying pipeline class is automatically selected from either the
~ORTPipelineForImage2Image.from_pretrained
or ~ORTPipelineForImage2Image.from_pipe
methods.
This class cannot be instantiated using __init__()
(throws an error).
Class attributes:
- config_name (
str
) — The configuration filename that stores the class and module names of all the diffusion pipeline’s components. - auto_model_class (
Type[DiffusionPipeline]
) — The corresponding/equivalent Diffusers pipeline class. - ort_pipelines_mapping (
OrderedDict
) — The mapping between the model names/architectures and the corresponding ORT pipeline class.
ORTPipelineForInpainting
ORTPipelineForInpainting
is a generic pipeline class that instantiates an inpainting pipeline class. The
specific underlying pipeline class is automatically selected from either the
~ORTPipelineForInpainting.from_pretrained
or ~ORTPipelineForInpainting.from_pipe
methods.
This class cannot be instantiated using __init__()
(throws an error).
Class attributes:
- config_name (
str
) — The configuration filename that stores the class and module names of all the diffusion pipeline’s components. - auto_model_class (
Type[DiffusionPipeline]
) — The corresponding/equivalent Diffusers pipeline class. - ort_pipelines_mapping (
OrderedDict
) — The mapping between the model names/architectures and the corresponding ORT pipeline class.
Supported ORT Diffusion Pipeline classes
The following classes are available for instantiating a diffusion pipeline class for a specific task and architecture.
ORTStableDiffusionPipeline
class optimum.onnxruntime.ORTStableDiffusionPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-to-image generation using Stable Diffusion and corresponding to [StableDiffusionPipeline] (https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline).
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide image generation. If not defined, you need to passprompt_embeds
. - height (
int
, optional, defaults toself.unet.config.sample_size * self.vae_scale_factor
) — The height in pixels of the generated image. - width (
int
, optional, defaults toself.unet.config.sample_size * self.vae_scale_factor
) — The width in pixels of the generated image. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process with schedulers which support atimesteps
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. Must be in descending order. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - guidance_scale (
float
, optional, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the textprompt
at the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1
. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to passnegative_prompt_embeds
instead. Ignored when not using guidance (guidance_scale < 1
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the~schedulers.DDIMScheduler
, and is ignored in other schedulers. - generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embeds
are generated from thenegative_prompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generated image. Choose betweenPIL.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
instead of a plain tuple. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
. - guidance_rescale (
float
, optional, defaults to 0.0) — Guidance rescale factor from Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
,PipelineCallback
,MultiPipelineCallbacks
, optional) — A function or a subclass ofPipelineCallback
orMultiPipelineCallbacks
that is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
or tuple
If return_dict
is True
, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
is returned,
otherwise a tuple
is returned where the first element is a list with the generated images and the
second element is a list of bool
s indicating whether the corresponding generated image contains
“not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from optimum.onnxruntime import ORTStableDiffusionPipeline
>>> pipe = ORTStableORTDiffusionPipeline.from_pretrained(
... "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")
>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> image = pipe(prompt).images[0]
ORTStableDiffusionImg2ImgPipeline
class optimum.onnxruntime.ORTStableDiffusionImg2ImgPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-guided image-to-image generation using Stable Diffusion and corresponding to [StableDiffusionImg2ImgPipeline] (https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/img2img#diffusers.StableDiffusionImg2ImgPipeline).
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None strength: float = 0.8 num_inference_steps: typing.Optional[int] = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: typing.Optional[float] = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: typing.Optional[float] = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: int = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide image generation. If not defined, you need to passprompt_embeds
. - image (
torch.Tensor
,PIL.Image.Image
,np.ndarray
,List[torch.Tensor]
,List[PIL.Image.Image]
, orList[np.ndarray]
) —Image
, numpy array or tensor representing an image batch to be used as the starting point. For both numpy array and pytorch tensor, the expected value range is between[0, 1]
If it’s a tensor or a list or tensors, the expected shape should be(B, C, H, W)
or(C, H, W)
. If it is a numpy array or a list of arrays, the expected shape should be(B, H, W, C)
or(H, W, C)
It can also accept image latents asimage
, but if passing latents directly it is not encoded again. - strength (
float
, optional, defaults to 0.8) — Indicates extent to transform the referenceimage
. Must be between 0 and 1.image
is used as a starting point and more noise is added the higher thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise is maximum and the denoising process runs for the full number of iterations specified innum_inference_steps
. A value of 1 essentially ignoresimage
. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. This parameter is modulated bystrength
. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process with schedulers which support atimesteps
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. Must be in descending order. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - guidance_scale (
float
, optional, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the textprompt
at the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1
. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to passnegative_prompt_embeds
instead. Ignored when not using guidance (guidance_scale < 1
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the~schedulers.DDIMScheduler
, and is ignored in other schedulers. - generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embeds
are generated from thenegative_prompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generated image. Choose betweenPIL.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
instead of a plain tuple. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
,PipelineCallback
,MultiPipelineCallbacks
, optional) — A function or a subclass ofPipelineCallback
orMultiPipelineCallbacks
that is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
or tuple
If return_dict
is True
, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
is returned,
otherwise a tuple
is returned where the first element is a list with the generated images and the
second element is a list of bool
s indicating whether the corresponding generated image contains
“not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
>>> import requests
>>> import torch
>>> from PIL import Image
>>> from io import BytesIO
>>> from optimum.onnxruntime import ORTStableDiffusionImg2ImgPipeline
>>> device = "cuda"
>>> model_id_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5"
>>> pipe = ORTStableDiffusionImg2ImgPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
>>> pipe = pipe.to(device)
>>> url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
>>> response = requests.get(url)
>>> init_image = Image.open(BytesIO(response.content)).convert("RGB")
>>> init_image = init_image.resize((768, 512))
>>> prompt = "A fantasy landscape, trending on artstation"
>>> images = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images
>>> images[0].save("fantasy_landscape.png")
ORTStableDiffusionInpaintPipeline
class optimum.onnxruntime.ORTStableDiffusionInpaintPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-guided image inpainting using Stable Diffusion and corresponding to [StableDiffusionInpaintPipeline] (https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/inpaint#diffusers.StableDiffusionInpaintPipeline).
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None mask_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None masked_image_latents: Tensor = None height: typing.Optional[int] = None width: typing.Optional[int] = None padding_mask_crop: typing.Optional[int] = None strength: float = 1.0 num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: int = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide image generation. If not defined, you need to passprompt_embeds
. - image (
torch.Tensor
,PIL.Image.Image
,np.ndarray
,List[torch.Tensor]
,List[PIL.Image.Image]
, orList[np.ndarray]
) —Image
, numpy array or tensor representing an image batch to be inpainted (which parts of the image to be masked out withmask_image
and repainted according toprompt
). For both numpy array and pytorch tensor, the expected value range is between[0, 1]
If it’s a tensor or a list or tensors, the expected shape should be(B, C, H, W)
or(C, H, W)
. If it is a numpy array or a list of arrays, the expected shape should be(B, H, W, C)
or(H, W, C)
It can also accept image latents asimage
, but if passing latents directly it is not encoded again. - mask_image (
torch.Tensor
,PIL.Image.Image
,np.ndarray
,List[torch.Tensor]
,List[PIL.Image.Image]
, orList[np.ndarray]
) —Image
, numpy array or tensor representing an image batch to maskimage
. White pixels in the mask are repainted while black pixels are preserved. Ifmask_image
is a PIL image, it is converted to a single channel (luminance) before use. If it’s a numpy array or pytorch tensor, it should contain one color channel (L) instead of 3, so the expected shape for pytorch tensor would be(B, 1, H, W)
,(B, H, W)
,(1, H, W)
,(H, W)
. And for numpy array would be for(B, H, W, 1)
,(B, H, W)
,(H, W, 1)
, or(H, W)
. - height (
int
, optional, defaults toself.unet.config.sample_size * self.vae_scale_factor
) — The height in pixels of the generated image. - width (
int
, optional, defaults toself.unet.config.sample_size * self.vae_scale_factor
) — The width in pixels of the generated image. - padding_mask_crop (
int
, optional, defaults toNone
) — The size of margin in the crop to be applied to the image and masking. IfNone
, no crop is applied to image and mask_image. Ifpadding_mask_crop
is notNone
, it will first find a rectangular region with the same aspect ration of the image and contains all masked area, and then expand that area based onpadding_mask_crop
. The image and mask_image will then be cropped based on the expanded area before resizing to the original image size for inpainting. This is useful when the masked area is small while the image is large and contain information irrelevant for inpainting, such as background. - strength (
float
, optional, defaults to 1.0) — Indicates extent to transform the referenceimage
. Must be between 0 and 1.image
is used as a starting point and more noise is added the higher thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise is maximum and the denoising process runs for the full number of iterations specified innum_inference_steps
. A value of 1 essentially ignoresimage
. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. This parameter is modulated bystrength
. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process with schedulers which support atimesteps
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. Must be in descending order. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - guidance_scale (
float
, optional, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the textprompt
at the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1
. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to passnegative_prompt_embeds
instead. Ignored when not using guidance (guidance_scale < 1
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the~schedulers.DDIMScheduler
, and is ignored in other schedulers. - generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embeds
are generated from thenegative_prompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generated image. Choose betweenPIL.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
instead of a plain tuple. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
,PipelineCallback
,MultiPipelineCallbacks
, optional) — A function or a subclass ofPipelineCallback
orMultiPipelineCallbacks
that is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
or tuple
If return_dict
is True
, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
is returned,
otherwise a tuple
is returned where the first element is a list with the generated images and the
second element is a list of bool
s indicating whether the corresponding generated image contains
“not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
>>> import PIL
>>> import requests
>>> import torch
>>> from io import BytesIO
>>> from optimum.onnxruntime import ORTStableDiffusionInpaintPipeline
>>> def download_image(url):
... response = requests.get(url)
... return PIL.Image.open(BytesIO(response.content)).convert("RGB")
>>> img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
>>> mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
>>> init_image = download_image(img_url).resize((512, 512))
>>> mask_image = download_image(mask_url).resize((512, 512))
>>> pipe = ORTStableDiffusionInpaintPipeline.from_pretrained(
... "stable-diffusion-v1-5/stable-diffusion-inpainting", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")
>>> prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
>>> image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
ORTStableDiffusionXLPipeline
class optimum.onnxruntime.ORTStableDiffusionXLPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-to-image generation using Stable Diffusion XL and corresponding to [StableDiffusionXLPipeline] (https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline).
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Optional[typing.Tuple[int, int]] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → diffusers.pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - prompt_2 (
str
orList[str]
, optional) — The prompt or prompts to be sent to thetokenizer_2
andtext_encoder_2
. If not defined,prompt
is used in both text-encoders - height (
int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image. This is set to 1024 by default for the best results. Anything below 512 pixels won’t work well for stabilityai/stable-diffusion-xl-base-1.0 and checkpoints that are not specifically fine-tuned on low resolutions. - width (
int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image. This is set to 1024 by default for the best results. Anything below 512 pixels won’t work well for stabilityai/stable-diffusion-xl-base-1.0 and checkpoints that are not specifically fine-tuned on low resolutions. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process with schedulers which support atimesteps
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. Must be in descending order. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - denoising_end (
float
, optional) — When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise as determined by the discrete timesteps selected by the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in Refining the Image Output - guidance_scale (
float
, optional, defaults to 5.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used in both text-encoders - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://huggingface.co/papers/2010.02502. Only applies toschedulers.DDIMScheduler
, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated fromprompt
input argument. - negative_pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput
instead of a plain tuple. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor. - guidance_rescale (
float
, optional, defaults to 0.0) — Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawedguidance_scale
is defined asφ
in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR. - original_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — Iforiginal_size
is not the same astarget_size
the image will appear to be down- or upsampled.original_size
defaults to(height, width)
if not specified. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - crops_coords_top_left (
Tuple[int]
, optional, defaults to (0, 0)) —crops_coords_top_left
can be used to generate an image that appears to be “cropped” from the positioncrops_coords_top_left
downwards. Favorable, well-centered images are usually achieved by settingcrops_coords_top_left
to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - target_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — For most cases,target_size
should be set to the desired height and width of the generated image. If not specified it will default to(height, width)
. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - negative_original_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — To negatively condition the generation process based on a specific image resolution. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - negative_crops_coords_top_left (
Tuple[int]
, optional, defaults to (0, 0)) — To negatively condition the generation process based on a specific crop coordinates. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - negative_target_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — To negatively condition the generation process based on a target image resolution. It should be as same as thetarget_size
for most cases. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - callback_on_step_end (
Callable
,PipelineCallback
,MultiPipelineCallbacks
, optional) — A function or a subclass ofPipelineCallback
orMultiPipelineCallbacks
that is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
diffusers.pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput
or tuple
diffusers.pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput
if return_dict
is True, otherwise a
tuple
. When returning a tuple, the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from optimum.onnxruntime import ORTStableDiffusionXLPipeline
>>> pipe = ORTStableDiffusionXLPipeline.from_pretrained(
... "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")
>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> image = pipe(prompt).images[0]
ORTStableDiffusionXLImg2ImgPipeline
class optimum.onnxruntime.ORTStableDiffusionXLImg2ImgPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-guided image-to-image generation using Stable Diffusion XL and corresponding to [StableDiffusionXLImg2ImgPipeline] (https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline).
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None strength: float = 0.3 num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_start: typing.Optional[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None aesthetic_score: float = 6.0 negative_aesthetic_score: float = 2.5 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - prompt_2 (
str
orList[str]
, optional) — The prompt or prompts to be sent to thetokenizer_2
andtext_encoder_2
. If not defined,prompt
is used in both text-encoders - image (
torch.Tensor
orPIL.Image.Image
ornp.ndarray
orList[torch.Tensor]
orList[PIL.Image.Image]
orList[np.ndarray]
) — The image(s) to modify with the pipeline. - strength (
float
, optional, defaults to 0.3) — Conceptually, indicates how much to transform the referenceimage
. Must be between 0 and 1.image
will be used as a starting point, adding more noise to it the larger thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise will be maximum and the denoising process will run for the full number of iterations specified innum_inference_steps
. A value of 1, therefore, essentially ignoresimage
. Note that in the case ofdenoising_start
being declared as an integer, the value ofstrength
will be ignored. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process with schedulers which support atimesteps
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. Must be in descending order. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - denoising_start (
float
, optional) — When specified, indicates the fraction (between 0.0 and 1.0) of the total denoising process to be bypassed before it is initiated. Consequently, the initial part of the denoising process is skipped and it is assumed that the passedimage
is a partly denoised image. Note that when this is specified, strength will be ignored. Thedenoising_start
parameter is particularly beneficial when this pipeline is integrated into a “Mixture of Denoisers” multi-pipeline setup, as detailed in Refine Image Quality. - denoising_end (
float
, optional) — When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise (ca. final 20% of timesteps still needed) and should be denoised by a successor pipeline that hasdenoising_start
set to 0.8 so that it only denoises the final 20% of the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in Refine Image Quality. - guidance_scale (
float
, optional, defaults to 7.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used in both text-encoders - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://huggingface.co/papers/2010.02502. Only applies toschedulers.DDIMScheduler
, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated fromprompt
input argument. - negative_pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
instead of a plain tuple. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor. - guidance_rescale (
float
, optional, defaults to 0.0) — Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawedguidance_scale
is defined asφ
in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR. - original_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — Iforiginal_size
is not the same astarget_size
the image will appear to be down- or upsampled.original_size
defaults to(height, width)
if not specified. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - crops_coords_top_left (
Tuple[int]
, optional, defaults to (0, 0)) —crops_coords_top_left
can be used to generate an image that appears to be “cropped” from the positioncrops_coords_top_left
downwards. Favorable, well-centered images are usually achieved by settingcrops_coords_top_left
to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - target_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — For most cases,target_size
should be set to the desired height and width of the generated image. If not specified it will default to(height, width)
. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - negative_original_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — To negatively condition the generation process based on a specific image resolution. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - negative_crops_coords_top_left (
Tuple[int]
, optional, defaults to (0, 0)) — To negatively condition the generation process based on a specific crop coordinates. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - negative_target_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — To negatively condition the generation process based on a target image resolution. It should be as same as thetarget_size
for most cases. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - aesthetic_score (
float
, optional, defaults to 6.0) — Used to simulate an aesthetic score of the generated image by influencing the positive text condition. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - negative_aesthetic_score (
float
, optional, defaults to 2.5) — Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. Can be used to simulate an aesthetic score of the generated image by influencing the negative text condition. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
,PipelineCallback
,MultiPipelineCallbacks
, optional) — A function or a subclass ofPipelineCallback
orMultiPipelineCallbacks
that is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
or tuple
diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
if return_dict
is True, otherwise a
`tuple. When returning a tuple, the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from optimum.onnxruntime import ORTStableDiffusionXLImg2ImgPipeline
>>> from diffusers.utils import load_image
>>> pipe = ORTStableDiffusionXLImg2ImgPipeline.from_pretrained(
... "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")
>>> url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"
>>> init_image = load_image(url).convert("RGB")
>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> image = pipe(prompt, image=init_image).images[0]
ORTStableDiffusionXLInpaintPipeline
class optimum.onnxruntime.ORTStableDiffusionXLInpaintPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-guided image inpainting using Stable Diffusion XL and corresponding to [StableDiffusionXLInpaintPipeline] (https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLInpaintPipeline).
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None mask_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None masked_image_latents: Tensor = None height: typing.Optional[int] = None width: typing.Optional[int] = None padding_mask_crop: typing.Optional[int] = None strength: float = 0.9999 num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_start: typing.Optional[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None aesthetic_score: float = 6.0 negative_aesthetic_score: float = 2.5 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - prompt_2 (
str
orList[str]
, optional) — The prompt or prompts to be sent to thetokenizer_2
andtext_encoder_2
. If not defined,prompt
is used in both text-encoders - image (
PIL.Image.Image
) —Image
, or tensor representing an image batch which will be inpainted, i.e. parts of the image will be masked out withmask_image
and repainted according toprompt
. - mask_image (
PIL.Image.Image
) —Image
, or tensor representing an image batch, to maskimage
. White pixels in the mask will be repainted, while black pixels will be preserved. Ifmask_image
is a PIL image, it will be converted to a single channel (luminance) before use. If it’s a tensor, it should contain one color channel (L) instead of 3, so the expected shape would be(B, H, W, 1)
. - height (
int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image. This is set to 1024 by default for the best results. Anything below 512 pixels won’t work well for stabilityai/stable-diffusion-xl-base-1.0 and checkpoints that are not specifically fine-tuned on low resolutions. - width (
int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image. This is set to 1024 by default for the best results. Anything below 512 pixels won’t work well for stabilityai/stable-diffusion-xl-base-1.0 and checkpoints that are not specifically fine-tuned on low resolutions. - padding_mask_crop (
int
, optional, defaults toNone
) — The size of margin in the crop to be applied to the image and masking. IfNone
, no crop is applied to image and mask_image. Ifpadding_mask_crop
is notNone
, it will first find a rectangular region with the same aspect ration of the image and contains all masked area, and then expand that area based onpadding_mask_crop
. The image and mask_image will then be cropped based on the expanded area before resizing to the original image size for inpainting. This is useful when the masked area is small while the image is large and contain information irrelevant for inpainting, such as background. - strength (
float
, optional, defaults to 0.9999) — Conceptually, indicates how much to transform the masked portion of the referenceimage
. Must be between 0 and 1.image
will be used as a starting point, adding more noise to it the larger thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise will be maximum and the denoising process will run for the full number of iterations specified innum_inference_steps
. A value of 1, therefore, essentially ignores the masked portion of the referenceimage
. Note that in the case ofdenoising_start
being declared as an integer, the value ofstrength
will be ignored. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process with schedulers which support atimesteps
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. Must be in descending order. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - denoising_start (
float
, optional) — When specified, indicates the fraction (between 0.0 and 1.0) of the total denoising process to be bypassed before it is initiated. Consequently, the initial part of the denoising process is skipped and it is assumed that the passedimage
is a partly denoised image. Note that when this is specified, strength will be ignored. Thedenoising_start
parameter is particularly beneficial when this pipeline is integrated into a “Mixture of Denoisers” multi-pipeline setup, as detailed in Refining the Image Output. - denoising_end (
float
, optional) — When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise (ca. final 20% of timesteps still needed) and should be denoised by a successor pipeline that hasdenoising_start
set to 0.8 so that it only denoises the final 20% of the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in Refining the Image Output. - guidance_scale (
float
, optional, defaults to 7.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used in both text-encoders - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated fromprompt
input argument. - negative_pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://huggingface.co/papers/2010.02502. Only applies toschedulers.DDIMScheduler
, will be ignored for others. - generator (
torch.Generator
, optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
instead of a plain tuple. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor. - original_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — Iforiginal_size
is not the same astarget_size
the image will appear to be down- or upsampled.original_size
defaults to(height, width)
if not specified. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - crops_coords_top_left (
Tuple[int]
, optional, defaults to (0, 0)) —crops_coords_top_left
can be used to generate an image that appears to be “cropped” from the positioncrops_coords_top_left
downwards. Favorable, well-centered images are usually achieved by settingcrops_coords_top_left
to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - target_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — For most cases,target_size
should be set to the desired height and width of the generated image. If not specified it will default to(height, width)
. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - negative_original_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — To negatively condition the generation process based on a specific image resolution. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - negative_crops_coords_top_left (
Tuple[int]
, optional, defaults to (0, 0)) — To negatively condition the generation process based on a specific crop coordinates. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - negative_target_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — To negatively condition the generation process based on a target image resolution. It should be as same as thetarget_size
for most cases. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - aesthetic_score (
float
, optional, defaults to 6.0) — Used to simulate an aesthetic score of the generated image by influencing the positive text condition. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - negative_aesthetic_score (
float
, optional, defaults to 2.5) — Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. Can be used to simulate an aesthetic score of the generated image by influencing the negative text condition. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
,PipelineCallback
,MultiPipelineCallbacks
, optional) — A function or a subclass ofPipelineCallback
orMultiPipelineCallbacks
that is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
or tuple
diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
if return_dict
is True, otherwise a
tuple.
tuple. When returning a tuple, the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from optimum.onnxruntime import ORTStableDiffusionXLInpaintPipeline
>>> from diffusers.utils import load_image
>>> pipe = ORTStableDiffusionXLInpaintPipeline.from_pretrained(
... "stabilityai/stable-diffusion-xl-base-1.0",
... torch_dtype=torch.float16,
... variant="fp16",
... use_safetensors=True,
... )
>>> pipe.to("cuda")
>>> img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
>>> mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
>>> init_image = load_image(img_url).convert("RGB")
>>> mask_image = load_image(mask_url).convert("RGB")
>>> prompt = "A majestic tiger sitting on a bench"
>>> image = pipe(
... prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80
... ).images[0]
ORTStableDiffusionXLImg2ImgPipeline
class optimum.onnxruntime.ORTStableDiffusionXLImg2ImgPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-guided image-to-image generation using Stable Diffusion XL and corresponding to [StableDiffusionXLImg2ImgPipeline] (https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline).
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None strength: float = 0.3 num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_start: typing.Optional[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None aesthetic_score: float = 6.0 negative_aesthetic_score: float = 2.5 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - prompt_2 (
str
orList[str]
, optional) — The prompt or prompts to be sent to thetokenizer_2
andtext_encoder_2
. If not defined,prompt
is used in both text-encoders - image (
torch.Tensor
orPIL.Image.Image
ornp.ndarray
orList[torch.Tensor]
orList[PIL.Image.Image]
orList[np.ndarray]
) — The image(s) to modify with the pipeline. - strength (
float
, optional, defaults to 0.3) — Conceptually, indicates how much to transform the referenceimage
. Must be between 0 and 1.image
will be used as a starting point, adding more noise to it the larger thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise will be maximum and the denoising process will run for the full number of iterations specified innum_inference_steps
. A value of 1, therefore, essentially ignoresimage
. Note that in the case ofdenoising_start
being declared as an integer, the value ofstrength
will be ignored. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process with schedulers which support atimesteps
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. Must be in descending order. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - denoising_start (
float
, optional) — When specified, indicates the fraction (between 0.0 and 1.0) of the total denoising process to be bypassed before it is initiated. Consequently, the initial part of the denoising process is skipped and it is assumed that the passedimage
is a partly denoised image. Note that when this is specified, strength will be ignored. Thedenoising_start
parameter is particularly beneficial when this pipeline is integrated into a “Mixture of Denoisers” multi-pipeline setup, as detailed in Refine Image Quality. - denoising_end (
float
, optional) — When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise (ca. final 20% of timesteps still needed) and should be denoised by a successor pipeline that hasdenoising_start
set to 0.8 so that it only denoises the final 20% of the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in Refine Image Quality. - guidance_scale (
float
, optional, defaults to 7.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used in both text-encoders - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://huggingface.co/papers/2010.02502. Only applies toschedulers.DDIMScheduler
, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated fromprompt
input argument. - negative_pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
instead of a plain tuple. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor. - guidance_rescale (
float
, optional, defaults to 0.0) — Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawedguidance_scale
is defined asφ
in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR. - original_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — Iforiginal_size
is not the same astarget_size
the image will appear to be down- or upsampled.original_size
defaults to(height, width)
if not specified. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - crops_coords_top_left (
Tuple[int]
, optional, defaults to (0, 0)) —crops_coords_top_left
can be used to generate an image that appears to be “cropped” from the positioncrops_coords_top_left
downwards. Favorable, well-centered images are usually achieved by settingcrops_coords_top_left
to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - target_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — For most cases,target_size
should be set to the desired height and width of the generated image. If not specified it will default to(height, width)
. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - negative_original_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — To negatively condition the generation process based on a specific image resolution. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - negative_crops_coords_top_left (
Tuple[int]
, optional, defaults to (0, 0)) — To negatively condition the generation process based on a specific crop coordinates. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - negative_target_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — To negatively condition the generation process based on a target image resolution. It should be as same as thetarget_size
for most cases. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. - aesthetic_score (
float
, optional, defaults to 6.0) — Used to simulate an aesthetic score of the generated image by influencing the positive text condition. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - negative_aesthetic_score (
float
, optional, defaults to 2.5) — Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. Can be used to simulate an aesthetic score of the generated image by influencing the negative text condition. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
,PipelineCallback
,MultiPipelineCallbacks
, optional) — A function or a subclass ofPipelineCallback
orMultiPipelineCallbacks
that is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
or tuple
diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput
if return_dict
is True, otherwise a
`tuple. When returning a tuple, the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from optimum.onnxruntime import ORTStableDiffusionXLImg2ImgPipeline
>>> from diffusers.utils import load_image
>>> pipe = ORTStableDiffusionXLImg2ImgPipeline.from_pretrained(
... "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")
>>> url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"
>>> init_image = load_image(url).convert("RGB")
>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> image = pipe(prompt, image=init_image).images[0]
ORTLatentConsistencyModelPipeline
class optimum.onnxruntime.ORTLatentConsistencyModelPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-to-image generation using a Latent Consistency Model and corresponding to [LatentConsistencyModelPipeline] (https://huggingface.co/docs/diffusers/api/pipelines/latent_consistency_models#diffusers.LatentConsistencyModelPipeline).
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 4 original_inference_steps: int = None timesteps: typing.List[int] = None guidance_scale: float = 8.5 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide image generation. If not defined, you need to passprompt_embeds
. - height (
int
, optional, defaults toself.unet.config.sample_size * self.vae_scale_factor
) — The height in pixels of the generated image. - width (
int
, optional, defaults toself.unet.config.sample_size * self.vae_scale_factor
) — The width in pixels of the generated image. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - original_inference_steps (
int
, optional) — The original number of inference steps use to generate a linearly-spaced timestep schedule, from which we will drawnum_inference_steps
evenly spaced timesteps from as our final timestep schedule, following the Skipping-Step method in the paper (see Section 4.3). If not set this will default to the scheduler’soriginal_inference_steps
attribute. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps on the original LCM training/distillation timestep schedule are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the textprompt
at the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1
. Note that the original latent consistency models paper uses a different CFG formulation where the guidance scales are decreased by 1 (so in the paper formulation CFG is enabled whenguidance_scale > 0
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generated image. Choose betweenPIL.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
instead of a plain tuple. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
or tuple
If return_dict
is True
, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
is returned,
otherwise a tuple
is returned where the first element is a list with the generated images and the
second element is a list of bool
s indicating whether the corresponding generated image contains
“not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
>>> from optimum.onnxruntime import ORTDiffusionPipeline
>>> import torch
>>> pipe = ORTDiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7")
>>> # To save GPU memory, torch.float16 can be used, but it may compromise image quality.
>>> pipe.to(torch_device="cuda", torch_dtype=torch.float32)
>>> prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"
>>> # Can be set to 1~50 steps. LCM support fast inference even <= 4 steps. Recommend: 1~8 steps.
>>> num_inference_steps = 4
>>> images = pipe(prompt=prompt, num_inference_steps=num_inference_steps, guidance_scale=8.0).images
>>> images[0].save("image.png")
ORTLatentConsistencyModelImg2ImgPipeline
class optimum.onnxruntime.ORTLatentConsistencyModelImg2ImgPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-guided image-to-image generation using a Latent Consistency Model and corresponding to [LatentConsistencyModelImg2ImgPipeline] (https://huggingface.co/docs/diffusers/api/pipelines/latent_consistency_models#diffusers.LatentConsistencyModelImg2ImgPipeline).
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None num_inference_steps: int = 4 strength: float = 0.8 original_inference_steps: int = None timesteps: typing.List[int] = None guidance_scale: float = 8.5 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide image generation. If not defined, you need to passprompt_embeds
. - height (
int
, optional, defaults toself.unet.config.sample_size * self.vae_scale_factor
) — The height in pixels of the generated image. - width (
int
, optional, defaults toself.unet.config.sample_size * self.vae_scale_factor
) — The width in pixels of the generated image. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - original_inference_steps (
int
, optional) — The original number of inference steps use to generate a linearly-spaced timestep schedule, from which we will drawnum_inference_steps
evenly spaced timesteps from as our final timestep schedule, following the Skipping-Step method in the paper (see Section 4.3). If not set this will default to the scheduler’soriginal_inference_steps
attribute. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps on the original LCM training/distillation timestep schedule are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the textprompt
at the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1
. Note that the original latent consistency models paper uses a different CFG formulation where the guidance scales are decreased by 1 (so in the paper formulation CFG is enabled whenguidance_scale > 0
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generated image. Choose betweenPIL.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
instead of a plain tuple. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
or tuple
If return_dict
is True
, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
is returned,
otherwise a tuple
is returned where the first element is a list with the generated images and the
second element is a list of bool
s indicating whether the corresponding generated image contains
“not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
>>> from optimum.onnxruntime import ORTPipelineForImage2Image
>>> import torch
>>> import PIL
>>> pipe = ORTPipelineForImage2Image.from_pretrained("SimianLuo/LCM_Dreamshaper_v7")
>>> # To save GPU memory, torch.float16 can be used, but it may compromise image quality.
>>> pipe.to(torch_device="cuda", torch_dtype=torch.float32)
>>> prompt = "High altitude snowy mountains"
>>> image = PIL.Image.open("./snowy_mountains.png")
>>> # Can be set to 1~50 steps. LCM support fast inference even <= 4 steps. Recommend: 1~8 steps.
>>> num_inference_steps = 4
>>> images = pipe(
... prompt=prompt, image=image, num_inference_steps=num_inference_steps, guidance_scale=8.0
... ).images
>>> images[0].save("image.png")
ORTStableDiffusion3Pipeline
class optimum.onnxruntime.ORTStableDiffusion3Pipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-to-image generation using Stable Diffusion 3 and corresponding to StableDiffusion3Pipeline.
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None prompt_3: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 28 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 256 skip_guidance_layers: typing.List[int] = None skip_layer_guidance_scale: float = 2.8 skip_layer_guidance_stop: float = 0.2 skip_layer_guidance_start: float = 0.01 mu: typing.Optional[float] = None ) → diffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - prompt_2 (
str
orList[str]
, optional) — The prompt or prompts to be sent totokenizer_2
andtext_encoder_2
. If not defined,prompt
is will be used instead - prompt_3 (
str
orList[str]
, optional) — The prompt or prompts to be sent totokenizer_3
andtext_encoder_3
. If not defined,prompt
is will be used instead - height (
int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image. This is set to 1024 by default for the best results. - width (
int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image. This is set to 1024 by default for the best results. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - guidance_scale (
float
, optional, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used instead - negative_prompt_3 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_3
andtext_encoder_3
. If not defined,negative_prompt
is used instead - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - pooled_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated fromprompt
input argument. - negative_pooled_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - ip_adapter_image (
PipelineImageInput
, optional) — Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
torch.Tensor
, optional) — Pre-generated image embeddings for IP-Adapter. Should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
instead of a plain tuple. - joint_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor. - callback_on_step_end (
Callable
, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class. - max_sequence_length (
int
defaults to 256) — Maximum sequence length to use with theprompt
. - skip_guidance_layers (
List[int]
, optional) — A list of integers that specify layers to skip during guidance. If not provided, all layers will be used for guidance. If provided, the guidance will only be applied to the layers specified in the list. Recommended value by StabiltyAI for Stable Diffusion 3.5 Medium is [7, 8, 9]. - skip_layer_guidance_scale (
int
, optional) — The scale of the guidance for the layers specified inskip_guidance_layers
. The guidance will be applied to the layers specified inskip_guidance_layers
with a scale ofskip_layer_guidance_scale
. The guidance will be applied to the rest of the layers with a scale of1
. - skip_layer_guidance_stop (
int
, optional) — The step at which the guidance for the layers specified inskip_guidance_layers
will stop. The guidance will be applied to the layers specified inskip_guidance_layers
until the fraction specified inskip_layer_guidance_stop
. Recommended value by StabiltyAI for Stable Diffusion 3.5 Medium is 0.2. - skip_layer_guidance_start (
int
, optional) — The step at which the guidance for the layers specified inskip_guidance_layers
will start. The guidance will be applied to the layers specified inskip_guidance_layers
from the fraction specified inskip_layer_guidance_start
. Recommended value by StabiltyAI for Stable Diffusion 3.5 Medium is 0.01. - mu (
float
, optional) —mu
value used fordynamic_shifting
.
Returns
diffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
or tuple
diffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
if return_dict
is True, otherwise a
tuple
. When returning a tuple, the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from optimum.onnxruntime import ORTStableDiffusion3Pipeline
>>> pipe = ORTStableDiffusion3Pipeline.from_pretrained(
... "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
... )
>>> pipe.to("cuda")
>>> prompt = "A cat holding a sign that says hello world"
>>> image = pipe(prompt).images[0]
>>> image.save("sd3.png")
ORTStableDiffusion3Img2ImgPipeline
class optimum.onnxruntime.ORTStableDiffusion3Img2ImgPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-guided image-to-image generation using Stable Diffusion 3 and corresponding to StableDiffusion3Img2ImgPipeline.
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None prompt_3: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None strength: float = 0.6 num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[torch.Tensor] = None return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 256 mu: typing.Optional[float] = None ) → diffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - prompt_2 (
str
orList[str]
, optional) — The prompt or prompts to be sent totokenizer_2
andtext_encoder_2
. If not defined,prompt
is will be used instead - prompt_3 (
str
orList[str]
, optional) — The prompt or prompts to be sent totokenizer_3
andtext_encoder_3
. If not defined,prompt
is will be used instead - height (
int
, optional, defaults to self.transformer.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image. This is set to 1024 by default for the best results. - width (
int
, optional, defaults to self.transformer.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image. This is set to 1024 by default for the best results. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - guidance_scale (
float
, optional, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used instead - negative_prompt_3 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_3
andtext_encoder_3
. If not defined,negative_prompt
is used instead - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - pooled_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated fromprompt
input argument. - negative_pooled_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - ip_adapter_image (
PipelineImageInput
, optional) — Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
torch.Tensor
, optional) — Pre-generated image embeddings for IP-Adapter. Should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
instead of a plain tuple. - joint_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor. - callback_on_step_end (
Callable
, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class. - max_sequence_length (
int
defaults to 256) — Maximum sequence length to use with theprompt
. - mu (
float
, optional) —mu
value used fordynamic_shifting
.
Returns
diffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
or tuple
diffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
if return_dict
is True, otherwise a
tuple
. When returning a tuple, the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from optimum.onnxruntime import ORTPipelineForImage2Image
>>> from diffusers.utils import load_image
>>> device = "cuda"
>>> model_id_or_path = "stabilityai/stable-diffusion-3-medium-diffusers"
>>> pipe = ORTPipelineForImage2Image.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
>>> pipe = pipe.to(device)
>>> url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
>>> init_image = load_image(url).resize((1024, 1024))
>>> prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
>>> images = pipe(prompt=prompt, image=init_image, strength=0.95, guidance_scale=7.5).images[0]
ORTStableDiffusion3InpaintPipeline
class optimum.onnxruntime.ORTStableDiffusion3InpaintPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-guided image inpainting using Stable Diffusion 3 and corresponding to StableDiffusion3InpaintPipeline.
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None prompt_3: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None mask_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None masked_image_latents: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None height: int = None width: int = None padding_mask_crop: typing.Optional[int] = None strength: float = 0.6 num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 256 mu: typing.Optional[float] = None ) → diffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - prompt_2 (
str
orList[str]
, optional) — The prompt or prompts to be sent totokenizer_2
andtext_encoder_2
. If not defined,prompt
is will be used instead - prompt_3 (
str
orList[str]
, optional) — The prompt or prompts to be sent totokenizer_3
andtext_encoder_3
. If not defined,prompt
is will be used instead - image (
torch.Tensor
,PIL.Image.Image
,np.ndarray
,List[torch.Tensor]
,List[PIL.Image.Image]
, orList[np.ndarray]
) —Image
, numpy array or tensor representing an image batch to be used as the starting point. For both numpy array and pytorch tensor, the expected value range is between[0, 1]
If it’s a tensor or a list or tensors, the expected shape should be(B, C, H, W)
or(C, H, W)
. If it is a numpy array or a list of arrays, the expected shape should be(B, H, W, C)
or(H, W, C)
It can also accept image latents asimage
, but if passing latents directly it is not encoded again. - mask_image (
torch.Tensor
,PIL.Image.Image
,np.ndarray
,List[torch.Tensor]
,List[PIL.Image.Image]
, orList[np.ndarray]
) —Image
, numpy array or tensor representing an image batch to maskimage
. White pixels in the mask are repainted while black pixels are preserved. Ifmask_image
is a PIL image, it is converted to a single channel (luminance) before use. If it’s a numpy array or pytorch tensor, it should contain one color channel (L) instead of 3, so the expected shape for pytorch tensor would be(B, 1, H, W)
,(B, H, W)
,(1, H, W)
,(H, W)
. And for numpy array would be for(B, H, W, 1)
,(B, H, W)
,(H, W, 1)
, or(H, W)
. - mask_image_latent (
torch.Tensor
,List[torch.Tensor]
) —Tensor
representing an image batch to maskimage
generated by VAE. If not provided, the mask latents tensor will ge generated bymask_image
. - height (
int
, optional, defaults to self.transformer.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image. This is set to 1024 by default for the best results. - width (
int
, optional, defaults to self.transformer.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image. This is set to 1024 by default for the best results. - padding_mask_crop (
int
, optional, defaults toNone
) — The size of margin in the crop to be applied to the image and masking. IfNone
, no crop is applied to image and mask_image. Ifpadding_mask_crop
is notNone
, it will first find a rectangular region with the same aspect ration of the image and contains all masked area, and then expand that area based onpadding_mask_crop
. The image and mask_image will then be cropped based on the expanded area before resizing to the original image size for inpainting. This is useful when the masked area is small while the image is large and contain information irrelevant for inpainting, such as background. - strength (
float
, optional, defaults to 1.0) — Indicates extent to transform the referenceimage
. Must be between 0 and 1.image
is used as a starting point and more noise is added the higher thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise is maximum and the denoising process runs for the full number of iterations specified innum_inference_steps
. A value of 1 essentially ignoresimage
. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - guidance_scale (
float
, optional, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used instead - negative_prompt_3 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_3
andtext_encoder_3
. If not defined,negative_prompt
is used instead - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - pooled_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated fromprompt
input argument. - negative_pooled_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - ip_adapter_image (
PipelineImageInput
, optional) — Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
torch.Tensor
, optional) — Pre-generated image embeddings for IP-Adapter. Should be a tensor of shape(batch_size, num_images, emb_dim)
. It should contain the negative image embedding ifdo_classifier_free_guidance
is set toTrue
. If not provided, embeddings are computed from theip_adapter_image
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
instead of a plain tuple. - joint_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor. - callback_on_step_end (
Callable
, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class. - max_sequence_length (
int
defaults to 256) — Maximum sequence length to use with theprompt
. - mu (
float
, optional) —mu
value used fordynamic_shifting
.
Returns
diffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
or tuple
diffusers.pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput
if return_dict
is True, otherwise a
tuple
. When returning a tuple, the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from optimum.onnxruntime import ORTStableDiffusion3InpaintPipeline
>>> from diffusers.utils import load_image
>>> pipe = ORTStableDiffusion3InpaintPipeline.from_pretrained(
... "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
... )
>>> pipe.to("cuda")
>>> prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
>>> img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
>>> mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
>>> source = load_image(img_url)
>>> mask = load_image(mask_url)
>>> image = pipe(prompt=prompt, image=source, mask_image=mask).images[0]
>>> image.save("sd3_inpainting.png")
ORTFluxPipeline
class optimum.onnxruntime.ORTFluxPipeline
< source >( unet_session: InferenceSession | None = None transformer_session: InferenceSession | None = None vae_decoder_session: InferenceSession | None = None vae_encoder_session: InferenceSession | None = None text_encoder_session: InferenceSession | None = None text_encoder_2_session: InferenceSession | None = None text_encoder_3_session: InferenceSession | None = None scheduler: SchedulerMixin | None = None tokenizer: CLIPTokenizer | None = None tokenizer_2: CLIPTokenizer | None = None tokenizer_3: CLIPTokenizer | None = None feature_extractor: CLIPFeatureExtractor | None = None force_zeros_for_empty_prompt: bool = True requires_aesthetics_score: bool = False add_watermarker: bool | None = None use_io_binding: bool | None = None model_save_dir: str | Path | TemporaryDirectory | None = None **kwargs )
ONNX Runtime-powered Pipeline for text-to-image generation using Flux and corresponding to FluxPipeline.
This Pipeline inherits from ORTDiffusionPipeline
and is used to run inference with the ONNX Runtime.
The pipeline can be loaded from a pretrained pipeline using the generic ORTDiffusionPipeline.from_pretrained
method.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt: typing.Union[str, typing.List[str]] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None true_cfg_scale: float = 1.0 height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 28 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 3.5 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None negative_ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None negative_ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → diffusers.pipelines.flux.FluxPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - prompt_2 (
str
orList[str]
, optional) — The prompt or prompts to be sent totokenizer_2
andtext_encoder_2
. If not defined,prompt
is will be used instead. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored iftrue_cfg_scale
is not greater than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used in all the text-encoders. - true_cfg_scale (
float
, optional, defaults to 1.0) — True classifier-free guidance (guidance scale) is enabled whentrue_cfg_scale
> 1 andnegative_prompt
is provided. - height (
int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image. This is set to 1024 by default for the best results. - width (
int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image. This is set to 1024 by default for the best results. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - guidance_scale (
float
, optional, defaults to 3.5) — Embedded guiddance scale is enabled by settingguidance_scale
> 1. Higherguidance_scale
encourages a model to generate images more aligned withprompt
at the expense of lower image quality.Guidance-distilled models approximates true classifer-free guidance for
guidance_scale
> 1. Refer to the paper to learn more. - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - pooled_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated fromprompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. If not provided, embeddings are computed from theip_adapter_image
input argument. - negative_ip_adapter_image —
(
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - negative_ip_adapter_image_embeds (
List[torch.Tensor]
, optional) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape(batch_size, num_images, emb_dim)
. If not provided, embeddings are computed from theip_adapter_image
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - negative_pooled_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return adiffusers.pipelines.flux.FluxPipelineOutput
instead of a plain tuple. - joint_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor. - callback_on_step_end (
Callable
, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class. - max_sequence_length (
int
defaults to 512) — Maximum sequence length to use with theprompt
.
Returns
diffusers.pipelines.flux.FluxPipelineOutput
or tuple
diffusers.pipelines.flux.FluxPipelineOutput
if return_dict
is True, otherwise a tuple
. When returning a tuple, the first element is a list with the generated
images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from optimum.onnxruntime import ORTFluxPipeline
>>> pipe = ORTFluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell")
>>> pipe.to("cuda")
>>> prompt = "A cat holding a sign that says hello world"
>>> # Depending on the variant being used, the pipeline call will slightly vary.
>>> # Refer to the pipeline documentation for more details.
>>> image = pipe(prompt, num_inference_steps=4, guidance_scale=0.0).images[0]
>>> image.save("flux.png")