Diffusers documentation

Kandinsky 3

You are viewing v0.24.0 version. A newer version v0.29.2 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Kandinsky 3

Kandinsky 3 is created by Vladimir Arkhipkin,Anastasia Maltseva,Igor Pavlov,Andrei Filatov,Arseniy Shakhmatov,Andrey Kuznetsov,Denis Dimitrov, Zein Shaheen

The description from it’s Github page:

Kandinsky 3.0 is an open-source text-to-image diffusion model built upon the Kandinsky2-x model family. In comparison to its predecessors, enhancements have been made to the text understanding and visual quality of the model, achieved by increasing the size of the text encoder and Diffusion U-Net models, respectively.

Its architecture includes 3 main components:

  1. FLAN-UL2, which is an encoder decoder model based on the T5 architecture.
  2. New U-Net architecture featuring BigGAN-deep blocks doubles depth while maintaining the same number of parameters.
  3. Sber-MoVQGAN is a decoder proven to have superior results in image restoration.

The original codebase can be found at ai-forever/Kandinsky-3.

Check out the Kandinsky Community organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting.

Make sure to check out the schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.


class diffusers.Kandinsky3Pipeline

< >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: Kandinsky3UNet scheduler: DDPMScheduler movq: VQModel )


< >

( prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 25 guidance_scale: float = 3.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 height: typing.Optional[int] = 1024 width: typing.Optional[int] = 1024 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 latents = None )


  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • timesteps (List[int], optional) — Custom timesteps to use for the denoising process. If not defined, equal spaced num_inference_steps timesteps are used. Must be in descending order.
  • guidance_scale (float, optional, defaults to 3.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • height (int, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated image.
  • width (int, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated image.
  • eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
  • generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
  • prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.IFPipelineOutput instead of a plain tuple.
  • callback (Callable, optional) — A function that will be called every callback_steps steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).
  • callback_steps (int, optional, defaults to 1) — The frequency at which the callback function will be called. If not specified, the callback will be called at every step.
  • clean_caption (bool, optional, defaults to True) — Whether or not to clean the caption before creating embeddings. Requires beautifulsoup4 and ftfy to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt.
  • cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.

Function invoked when calling the pipeline for generation.


< >

( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None _cut_context = False )


  • prompt (str or List[str], optional) — prompt to be encoded

Encodes the prompt into text encoder hidden states.

device: (torch.device, optional): torch device to place the resulting embeddings on num_images_per_prompt (int, optional, defaults to 1): number of images that should be generated per prompt do_classifier_free_guidance (bool, optional, defaults to True): whether to use classifier free guidance or not negative_prompt (str or List[str], optional): The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds. instead. If not defined, one has to pass negative_prompt_embeds. instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). prompt_embeds (torch.FloatTensor, optional): Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument. negative_prompt_embeds (torch.FloatTensor, optional): Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.


class diffusers.Kandinsky3Img2ImgPipeline

< >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: Kandinsky3UNet scheduler: DDPMScheduler movq: VQModel )


< >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[torch.FloatTensor, PIL.Image.Image, typing.List[torch.FloatTensor], typing.List[PIL.Image.Image]] = None strength: float = 0.3 num_inference_steps: int = 25 guidance_scale: float = 3.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 latents = None )


< >

( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None _cut_context = False )


  • prompt (str or List[str], optional) — prompt to be encoded

Encodes the prompt into text encoder hidden states.

device: (torch.device, optional): torch device to place the resulting embeddings on num_images_per_prompt (int, optional, defaults to 1): number of images that should be generated per prompt do_classifier_free_guidance (bool, optional, defaults to True): whether to use classifier free guidance or not negative_prompt (str or List[str], optional): The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds. instead. If not defined, one has to pass negative_prompt_embeds. instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). prompt_embeds (torch.FloatTensor, optional): Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument. negative_prompt_embeds (torch.FloatTensor, optional): Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.