unCLIP
Overview
Hierarchical Text-Conditional Image Generation with CLIP Latents by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
The abstract of the paper is the following:
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.
The unCLIP model in diffusers comes from kakaobrain’s karlo and the original codebase can be found here. Additionally, lucidrains has a DALL-E 2 recreation here.
Available Pipelines:
Pipeline | Tasks | Colab |
---|---|---|
pipeline_unclip.py | Text-to-Image Generation | - |
pipeline_unclip_image_variation.py | Image-Guided Image Generation | - |
UnCLIPPipeline
class diffusers.UnCLIPPipeline
< source >( prior: PriorTransformer decoder: UNet2DConditionModel text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer text_proj: UnCLIPTextProjModel super_res_first: UNet2DModel super_res_last: UNet2DModel prior_scheduler: UnCLIPScheduler decoder_scheduler: UnCLIPScheduler super_res_scheduler: UnCLIPScheduler )
Parameters
-
text_encoder (
CLIPTextModelWithProjection
) — Frozen text-encoder. -
tokenizer (
CLIPTokenizer
) — Tokenizer of class CLIPTokenizer. - prior (PriorTransformer) — The canonincal unCLIP prior to approximate the image embedding from the text embedding.
-
text_proj (
UnCLIPTextProjModel
) — Utility class to prepare and combine the embeddings before they are passed to the decoder. - decoder (UNet2DConditionModel) — The decoder to invert the image embedding into an image.
- super_res_first (UNet2DModel) — Super resolution unet. Used in all but the last step of the super resolution diffusion process.
- super_res_last (UNet2DModel) — Super resolution unet. Used in the last step of the super resolution diffusion process.
-
prior_scheduler (
UnCLIPScheduler
) — Scheduler used in the prior denoising process. Just a modified DDPMScheduler. -
decoder_scheduler (
UnCLIPScheduler
) — Scheduler used in the decoder denoising process. Just a modified DDPMScheduler. -
super_res_scheduler (
UnCLIPScheduler
) — Scheduler used in the super resolution denoising process. Just a modified DDPMScheduler.
Pipeline for text-to-image generation using unCLIP
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >( prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: int = 1 prior_num_inference_steps: int = 25 decoder_num_inference_steps: int = 25 super_res_num_inference_steps: int = 7 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prior_latents: typing.Optional[torch.FloatTensor] = None decoder_latents: typing.Optional[torch.FloatTensor] = None super_res_latents: typing.Optional[torch.FloatTensor] = None text_model_output: typing.Union[transformers.models.clip.modeling_clip.CLIPTextModelOutput, typing.Tuple, NoneType] = None text_attention_mask: typing.Optional[torch.Tensor] = None prior_guidance_scale: float = 4.0 decoder_guidance_scale: float = 8.0 output_type: typing.Optional[str] = 'pil' return_dict: bool = True )
Parameters
-
prompt (
str
orList[str]
) — The prompt or prompts to guide the image generation. This can only be left undefined iftext_model_output
andtext_attention_mask
is passed. -
num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. -
prior_num_inference_steps (
int
, optional, defaults to 25) — The number of denoising steps for the prior. More denoising steps usually lead to a higher quality image at the expense of slower inference. -
decoder_num_inference_steps (
int
, optional, defaults to 25) — The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality image at the expense of slower inference. -
super_res_num_inference_steps (
int
, optional, defaults to 7) — The number of denoising steps for super resolution. More denoising steps usually lead to a higher quality image at the expense of slower inference. -
generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. -
prior_latents (
torch.FloatTensor
of shape (batch size, embeddings dimension), optional) — Pre-generated noisy latents to be used as inputs for the prior. -
decoder_latents (
torch.FloatTensor
of shape (batch size, channels, height, width), optional) — Pre-generated noisy latents to be used as inputs for the decoder. -
super_res_latents (
torch.FloatTensor
of shape (batch size, channels, super res height, super res width), optional) — Pre-generated noisy latents to be used as inputs for the decoder. -
prior_guidance_scale (
float
, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. -
decoder_guidance_scale (
float
, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. -
text_model_output (
CLIPTextModelOutput
, optional) — Pre-defined CLIPTextModel outputs that can be derived from the text encoder. Pre-defined text outputs can be passed for tasks like text embedding interpolations. Make sure to also passtext_attention_mask
in this case.prompt
can the be left toNone
. -
text_attention_mask (
torch.Tensor
, optional) — Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention masks are necessary when passingtext_model_output
. -
output_type (
str
, optional, defaults to"pil"
) — The output format of the generated image. Choose between PIL:PIL.Image.Image
ornp.array
. -
return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a ImagePipelineOutput instead of a plain tuple.
Function invoked when calling the pipeline for generation.
Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline’s
models have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its
forward` method called.
class diffusers.UnCLIPImageVariationPipeline
< source >( decoder: UNet2DConditionModel text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer text_proj: UnCLIPTextProjModel feature_extractor: CLIPImageProcessor image_encoder: CLIPVisionModelWithProjection super_res_first: UNet2DModel super_res_last: UNet2DModel decoder_scheduler: UnCLIPScheduler super_res_scheduler: UnCLIPScheduler )
Parameters
-
text_encoder (
CLIPTextModelWithProjection
) — Frozen text-encoder. -
tokenizer (
CLIPTokenizer
) — Tokenizer of class CLIPTokenizer. -
feature_extractor (
CLIPImageProcessor
) — Model that extracts features from generated images to be used as inputs for theimage_encoder
. -
image_encoder (
CLIPVisionModelWithProjection
) — Frozen CLIP image-encoder. unCLIP Image Variation uses the vision portion of CLIP, specifically the clip-vit-large-patch14 variant. -
text_proj (
UnCLIPTextProjModel
) — Utility class to prepare and combine the embeddings before they are passed to the decoder. - decoder (UNet2DConditionModel) — The decoder to invert the image embedding into an image.
- super_res_first (UNet2DModel) — Super resolution unet. Used in all but the last step of the super resolution diffusion process.
- super_res_last (UNet2DModel) — Super resolution unet. Used in the last step of the super resolution diffusion process.
-
decoder_scheduler (
UnCLIPScheduler
) — Scheduler used in the decoder denoising process. Just a modified DDPMScheduler. -
super_res_scheduler (
UnCLIPScheduler
) — Scheduler used in the super resolution denoising process. Just a modified DDPMScheduler.
Pipeline to generate variations from an input image using unCLIP
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >( image: typing.Union[PIL.Image.Image, typing.List[PIL.Image.Image], torch.FloatTensor, NoneType] = None num_images_per_prompt: int = 1 decoder_num_inference_steps: int = 25 super_res_num_inference_steps: int = 7 generator: typing.Optional[torch._C.Generator] = None decoder_latents: typing.Optional[torch.FloatTensor] = None super_res_latents: typing.Optional[torch.FloatTensor] = None image_embeddings: typing.Optional[torch.Tensor] = None decoder_guidance_scale: float = 8.0 output_type: typing.Optional[str] = 'pil' return_dict: bool = True )
Parameters
-
image (
PIL.Image.Image
orList[PIL.Image.Image]
ortorch.FloatTensor
) — The image or images to guide the image generation. If you provide a tensor, it needs to comply with the configuration of thisCLIPImageProcessor
. Can be left toNone
only whenimage_embeddings
are passed. -
num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. -
decoder_num_inference_steps (
int
, optional, defaults to 25) — The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality image at the expense of slower inference. -
super_res_num_inference_steps (
int
, optional, defaults to 7) — The number of denoising steps for super resolution. More denoising steps usually lead to a higher quality image at the expense of slower inference. -
generator (
torch.Generator
, optional) — One or a list of torch generator(s) to make generation deterministic. -
decoder_latents (
torch.FloatTensor
of shape (batch size, channels, height, width), optional) — Pre-generated noisy latents to be used as inputs for the decoder. -
super_res_latents (
torch.FloatTensor
of shape (batch size, channels, super res height, super res width), optional) — Pre-generated noisy latents to be used as inputs for the decoder. -
decoder_guidance_scale (
float
, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. -
image_embeddings (
torch.Tensor
, optional) — Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings can be passed for tasks like image interpolations.image
can the be left toNone
. -
output_type (
str
, optional, defaults to"pil"
) — The output format of the generated image. Choose between PIL:PIL.Image.Image
ornp.array
. -
return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a ImagePipelineOutput instead of a plain tuple.
Function invoked when calling the pipeline for generation.
Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline’s
models have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its
forward` method called.