Kandinsky 3
Kandinsky 3 is created by Vladimir Arkhipkin,Anastasia Maltseva,Igor Pavlov,Andrei Filatov,Arseniy Shakhmatov,Andrey Kuznetsov,Denis Dimitrov, Zein Shaheen
The description from it’s Github page:
Kandinsky 3.0 is an open-source text-to-image diffusion model built upon the Kandinsky2-x model family. In comparison to its predecessors, enhancements have been made to the text understanding and visual quality of the model, achieved by increasing the size of the text encoder and Diffusion U-Net models, respectively.
Its architecture includes 3 main components:
- FLAN-UL2, which is an encoder decoder model based on the T5 architecture.
- New U-Net architecture featuring BigGAN-deep blocks doubles depth while maintaining the same number of parameters.
- Sber-MoVQGAN is a decoder proven to have superior results in image restoration.
The original codebase can be found at ai-forever/Kandinsky-3.
Check out the Kandinsky Community organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting.
Make sure to check out the schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
Kandinsky3Pipeline
class diffusers.Kandinsky3Pipeline
< source >( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: Kandinsky3UNet scheduler: DDPMScheduler movq: VQModel )
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 25 guidance_scale: float = 3.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 height: typing.Optional[int] = 1024 width: typing.Optional[int] = 1024 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 latents = None )
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 3.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - height (
int
, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated image. - width (
int
, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated image. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutput
instead of a plain tuple. - callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step. - clean_caption (
bool
, optional, defaults toTrue
) — Whether or not to clean the caption before creating embeddings. Requiresbeautifulsoup4
andftfy
to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor.
Function invoked when calling the pipeline for generation.
encode_prompt
< source >( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None _cut_context = False )
Encodes the prompt into text encoder hidden states.
device: (torch.device
, optional):
torch device to place the resulting embeddings on
num_images_per_prompt (int
, optional, defaults to 1):
number of images that should be generated per prompt
do_classifier_free_guidance (bool
, optional, defaults to True
):
whether to use classifier free guidance or not
negative_prompt (str
or List[str]
, optional):
The prompt or prompts not to guide the image generation. If not defined, one has to pass
negative_prompt_embeds
. instead. If not defined, one has to pass negative_prompt_embeds
. instead.
Ignored when not using guidance (i.e., ignored if guidance_scale
is less than 1
).
prompt_embeds (torch.FloatTensor
, optional):
Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not
provided, text embeddings will be generated from prompt
input argument.
negative_prompt_embeds (torch.FloatTensor
, optional):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt
weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt
input
argument.
Kandinsky3Img2ImgPipeline
class diffusers.Kandinsky3Img2ImgPipeline
< source >( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: Kandinsky3UNet scheduler: DDPMScheduler movq: VQModel )
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[torch.FloatTensor, PIL.Image.Image, typing.List[torch.FloatTensor], typing.List[PIL.Image.Image]] = None strength: float = 0.3 num_inference_steps: int = 25 guidance_scale: float = 3.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 latents = None )
encode_prompt
< source >( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None _cut_context = False )
Encodes the prompt into text encoder hidden states.
device: (torch.device
, optional):
torch device to place the resulting embeddings on
num_images_per_prompt (int
, optional, defaults to 1):
number of images that should be generated per prompt
do_classifier_free_guidance (bool
, optional, defaults to True
):
whether to use classifier free guidance or not
negative_prompt (str
or List[str]
, optional):
The prompt or prompts not to guide the image generation. If not defined, one has to pass
negative_prompt_embeds
. instead. If not defined, one has to pass negative_prompt_embeds
. instead.
Ignored when not using guidance (i.e., ignored if guidance_scale
is less than 1
).
prompt_embeds (torch.FloatTensor
, optional):
Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not
provided, text embeddings will be generated from prompt
input argument.
negative_prompt_embeds (torch.FloatTensor
, optional):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt
weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt
input
argument.