Spaces:
Runtime error
Runtime error
| <!--Copyright 2023 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| --> | |
| # Stable unCLIP | |
| Stable unCLIP checkpoints are finetuned from [stable diffusion 2.1](./stable_diffusion_2) checkpoints to condition on CLIP image embeddings. | |
| Stable unCLIP also still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used | |
| for text guided image variation. When combined with an unCLIP prior, it can also be used for full text to image generation. | |
| To know more about the unCLIP process, check out the following paper: | |
| [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. | |
| ## Tips | |
| Stable unCLIP takes a `noise_level` as input during inference. `noise_level` determines how much noise is added | |
| to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, | |
| we do not add any additional noise to the image embeddings i.e. `noise_level = 0`. | |
| ### Available checkpoints: | |
| * Image variation | |
| * [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) | |
| * [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small) | |
| * Text-to-image | |
| * [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small) | |
| ### Text-to-Image Generation | |
| Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha) | |
| ```python | |
| import torch | |
| from diffusers import UnCLIPScheduler, DDPMScheduler, StableUnCLIPPipeline | |
| from diffusers.models import PriorTransformer | |
| from transformers import CLIPTokenizer, CLIPTextModelWithProjection | |
| prior_model_id = "kakaobrain/karlo-v1-alpha" | |
| data_type = torch.float16 | |
| prior = PriorTransformer.from_pretrained(prior_model_id, subfolder="prior", torch_dtype=data_type) | |
| prior_text_model_id = "openai/clip-vit-large-patch14" | |
| prior_tokenizer = CLIPTokenizer.from_pretrained(prior_text_model_id) | |
| prior_text_model = CLIPTextModelWithProjection.from_pretrained(prior_text_model_id, torch_dtype=data_type) | |
| prior_scheduler = UnCLIPScheduler.from_pretrained(prior_model_id, subfolder="prior_scheduler") | |
| prior_scheduler = DDPMScheduler.from_config(prior_scheduler.config) | |
| stable_unclip_model_id = "stabilityai/stable-diffusion-2-1-unclip-small" | |
| pipe = StableUnCLIPPipeline.from_pretrained( | |
| stable_unclip_model_id, | |
| torch_dtype=data_type, | |
| variant="fp16", | |
| prior_tokenizer=prior_tokenizer, | |
| prior_text_encoder=prior_text_model, | |
| prior=prior, | |
| prior_scheduler=prior_scheduler, | |
| ) | |
| pipe = pipe.to("cuda") | |
| wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular" | |
| images = pipe(prompt=wave_prompt).images | |
| images[0].save("waves.png") | |
| ``` | |
| <Tip warning={true}> | |
| For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use. | |
| </Tip> | |
| ### Text guided Image-to-Image Variation | |
| ```python | |
| from diffusers import StableUnCLIPImg2ImgPipeline | |
| from diffusers.utils import load_image | |
| import torch | |
| pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( | |
| "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" | |
| ) | |
| pipe = pipe.to("cuda") | |
| url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" | |
| init_image = load_image(url) | |
| images = pipe(init_image).images | |
| images[0].save("variation_image.png") | |
| ``` | |
| Optionally, you can also pass a prompt to `pipe` such as: | |
| ```python | |
| prompt = "A fantasy landscape, trending on artstation" | |
| images = pipe(init_image, prompt=prompt).images | |
| images[0].save("variation_image_two.png") | |
| ``` | |
| ### Memory optimization | |
| If you are short on GPU memory, you can enable smart CPU offloading so that models that are not needed | |
| immediately for a computation can be offloaded to CPU: | |
| ```python | |
| from diffusers import StableUnCLIPImg2ImgPipeline | |
| from diffusers.utils import load_image | |
| import torch | |
| pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( | |
| "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" | |
| ) | |
| # Offload to CPU. | |
| pipe.enable_model_cpu_offload() | |
| url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" | |
| init_image = load_image(url) | |
| images = pipe(init_image).images | |
| images[0] | |
| ``` | |
| Further memory optimizations are possible by enabling VAE slicing on the pipeline: | |
| ```python | |
| from diffusers import StableUnCLIPImg2ImgPipeline | |
| from diffusers.utils import load_image | |
| import torch | |
| pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( | |
| "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" | |
| ) | |
| pipe.enable_model_cpu_offload() | |
| pipe.enable_vae_slicing() | |
| url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" | |
| init_image = load_image(url) | |
| images = pipe(init_image).images | |
| images[0] | |
| ``` | |
| ### StableUnCLIPPipeline | |
| [[autodoc]] StableUnCLIPPipeline | |
| - all | |
| - __call__ | |
| - enable_attention_slicing | |
| - disable_attention_slicing | |
| - enable_vae_slicing | |
| - disable_vae_slicing | |
| - enable_xformers_memory_efficient_attention | |
| - disable_xformers_memory_efficient_attention | |
| ### StableUnCLIPImg2ImgPipeline | |
| [[autodoc]] StableUnCLIPImg2ImgPipeline | |
| - all | |
| - __call__ | |
| - enable_attention_slicing | |
| - disable_attention_slicing | |
| - enable_vae_slicing | |
| - disable_vae_slicing | |
| - enable_xformers_memory_efficient_attention | |
| - disable_xformers_memory_efficient_attention | |