Fp16 version.

Browse files

Files changed (12) hide show

README.md +17 -126
model_index.json +1 -1
safety_checker/config.json +7 -4
safety_checker/pytorch_model.bin +2 -2
scheduler/scheduler_config.json +2 -6
text_encoder/config.json +3 -3
text_encoder/pytorch_model.bin +2 -2
tokenizer/tokenizer_config.json +1 -1
unet/config.json +2 -1
unet/diffusion_pytorch_model.bin +2 -2
vae/config.json +2 -1
vae/diffusion_pytorch_model.bin +2 -2

README.md CHANGED Viewed

@@ -6,16 +6,14 @@ tags:
 - stable-diffusion-diffusers
 - text-to-image
 extra_gated_prompt: |-
-  One more step before getting this model.
-  This model is open access and available to all, with a CreativeML OpenRAIL-M license further specifying rights and usage.
-  The CreativeML OpenRAIL License specifies:
   1. You can't use the model to deliberately produce nor share illegal or harmful outputs or content
-  2. CompVis claims no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in the license
   3. You may re-distribute the weights and use the model commercially and/or as a service. If you do, please be aware you have to include the same use restrictions as the ones in the license and share a copy of the CreativeML OpenRAIL-M to all your users (please read the license entirely and carefully)
   Please read the full license here: https://huggingface.co/spaces/CompVis/stable-diffusion-license
-  By clicking on "Access repository" below, you accept that your *contact information* (email address and username) can be shared with the model authors as well.
 extra_gated_fields:
  I have read the License and agree with its terms: checkbox
@@ -24,12 +22,10 @@ extra_gated_fields:
 # Stable Diffusion v1-4 Model Card
 Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
-For more information about how Stable Diffusion functions, please have a look at [🤗's Stable Diffusion with 🧨Diffusers blog](https://huggingface.co/blog/stable_diffusion).
-The **Stable-Diffusion-v1-4** checkpoint was initialized with the weights of the [Stable-Diffusion-v1-2](https:/steps/huggingface.co/CompVis/stable-diffusion-v1-2)
-checkpoint and subsequently fine-tuned on 225k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
-This weights here are intended to be used with the 🧨 Diffusers library. If you are looking for the weights to be loaded into the CompVis Stable Diffusion codebase, [come here](https://huggingface.co/CompVis/stable-diffusion-v-1-4-original)
 ## Model Details
 - **Developed by:** Robin Rombach, Patrick Esser
@@ -53,8 +49,6 @@ This weights here are intended to be used with the 🧨 Diffusers library. If yo
 We recommend using [🤗's Diffusers library](https://github.com/huggingface/diffusers) to run Stable Diffusion.
-### PyTorch
 ```bash
 pip install --upgrade diffusers transformers scipy
 ```
@@ -65,8 +59,7 @@ Run this command to log in with your HF Hub token if you haven't before:
 huggingface-cli login
 ```
-Running the pipeline with the default PNDM scheduler:
 ```python
 import torch
 from torch import autocast
@@ -75,32 +68,15 @@ from diffusers import StableDiffusionPipeline
 model_id = "CompVis/stable-diffusion-v1-4"
 device = "cuda"
 pipe = StableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True)
 pipe = pipe.to(device)
-prompt = "a photo of an astronaut riding a horse on mars"
 with autocast("cuda"):
-    image = pipe(prompt, guidance_scale=7.5).images[0]
-image.save("astronaut_rides_horse.png")
-```
-**Note**:
-If you are limited by GPU memory and have less than 10GB of GPU RAM available, please make sure to load the StableDiffusionPipeline in float16 precision instead of the default float32 precision as done above. You can do so by telling diffusers to expect the weights to be in float16 precision:
-```py
-import torch
-pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, revision="fp16", use_auth_token=True)
-pipe = pipe.to(device)
-prompt = "a photo of an astronaut riding a horse on mars"
-with autocast("cuda"):
-    image = pipe(prompt, guidance_scale=7.5).images[0]
-image.save("astronaut_rides_horse.png")
 ```
 To swap out the noise scheduler, pass it to `from_pretrained`:
@@ -113,81 +89,6 @@ model_id = "CompVis/stable-diffusion-v1-4"
 scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
 pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, use_auth_token=True)
 pipe = pipe.to("cuda")
-prompt = "a photo of an astronaut riding a horse on mars"
-with autocast("cuda"):
-    image = pipe(prompt, guidance_scale=7.5).images[0]
-image.save("astronaut_rides_horse.png")
-```
-### JAX/Flax
-To use StableDiffusion on TPUs and GPUs for faster inference you can leverage JAX/Flax.
-Running the pipeline with default PNDMScheduler
-```python
-import jax
-import numpy as np
-from flax.jax_utils import replicate
-from flax.training.common_utils import shard
-from diffusers import FlaxStableDiffusionPipeline
-pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
-    "CompVis/stable-diffusion-v1-4", revision="flax", dtype=jax.numpy.bfloat16
-)
-prompt = "a photo of an astronaut riding a horse on mars"
-prng_seed = jax.random.PRNGKey(0)
-num_inference_steps = 50
-num_samples = jax.device_count()
-prompt = num_samples * [prompt]
-prompt_ids = pipeline.prepare_inputs(prompt)
-# shard inputs and rng
-params = replicate(params)
-prng_seed = jax.random.split(prng_seed, 8)
-prompt_ids = shard(prompt_ids)
-images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
-images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
-```
-**Note**:
-If you are limited by TPU memory, please make sure to load the `FlaxStableDiffusionPipeline` in `bfloat16` precision instead of the default `float32` precision as done above. You can do so by telling diffusers to load the weights from "bf16" branch.
-```python
-import jax
-import numpy as np
-from flax.jax_utils import replicate
-from flax.training.common_utils import shard
-from diffusers import FlaxStableDiffusionPipeline
-pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
-    "CompVis/stable-diffusion-v1-4", revision="bf16", dtype=jax.numpy.bfloat16
-)
-prompt = "a photo of an astronaut riding a horse on mars"
-prng_seed = jax.random.PRNGKey(0)
-num_inference_steps = 50
-num_samples = jax.device_count()
-prompt = num_samples * [prompt]
-prompt_ids = pipeline.prepare_inputs(prompt)
-# shard inputs and rng
-params = replicate(params)
-prng_seed = jax.random.split(prng_seed, 8)
-prompt_ids = shard(prompt_ids)
-images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
-images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
 ```
 # Uses
@@ -239,8 +140,6 @@ Using the model to generate content that is cruel to individuals is a misuse of
   [LAION-5B](https://laion.ai/blog/laion-5b/) which contains adult material
   and is not fit for product use without additional safety mechanisms and
   considerations.
-- No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data.
-  The training data can be searched at [https://rom1504.github.io/clip-retrieval/](https://rom1504.github.io/clip-retrieval/) to possibly assist in the detection of memorized images.
 ### Bias
@@ -251,14 +150,6 @@ Texts and images from communities and cultures that use other languages are like
 This affects the overall output of the model, as white and western cultures are often set as the default. Further, the
 ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts.
-### Safety Module
-The intended use of this model is with the [Safety Checker](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) in Diffusers.
-This checker works by checking model outputs against known hard-coded NSFW concepts.
-The concepts are intentionally hidden to reduce the likelihood of reverse-engineering this filter.
-Specifically, the checker compares the class probability of harmful concepts in the embedding space of the `CLIPTextModel` *after generation* of the images.
-The concepts are passed into the model with the generated image and compared to a hand-engineered weight for each NSFW concept.
 ## Training
@@ -281,8 +172,8 @@ We currently provide four checkpoints, which were trained as follows.
 - [`stable-diffusion-v1-2`](https://huggingface.co/CompVis/stable-diffusion-v1-2): Resumed from `stable-diffusion-v1-1`.
   515,000 steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
 filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
-- [`stable-diffusion-v1-3`](https://huggingface.co/CompVis/stable-diffusion-v1-3): Resumed from `stable-diffusion-v1-2`. 195,000 steps at resolution `512x512` on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
-- [`stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4) Resumed from `stable-diffusion-v1-2`.225,000 steps at resolution `512x512` on "laion-aesthetics v2 5+"  and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
 - **Hardware:** 32 x 8 x A100 GPUs
 - **Optimizer:** AdamW
@@ -295,7 +186,7 @@ Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
 5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling
 steps show the relative improvements of the checkpoints:
-![pareto](https://huggingface.co/CompVis/stable-diffusion/resolve/main/v1-variants-scores.jpg)
 Evaluated using 50 PLMS steps and 10000 random prompts from the COCO2017 validation set, evaluated at 512x512 resolution.  Not optimized for FID scores.
 ## Environmental Impact
@@ -323,4 +214,4 @@ Based on that information, we estimate the following CO2 emissions using the [Ma
     }
 ```
-*This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*

 - stable-diffusion-diffusers
 - text-to-image
 extra_gated_prompt: |-
+  One more step before getting this model
+  This model is open access and available to all, but it has the CreativeML OpenRAIL-M license you have to be aware of before using it - don't worry you are just one click away!
+  By clicking on "Access repository" below, you accept that your *contact information* (email address and username) can be shared with the model authors as well.
+  Summary of the CreativeML OpenRAIL License:
   1. You can't use the model to deliberately produce nor share illegal or harmful outputs or content
+  2. We claim no rights on the outputs you generate, you are free to use them and are accountable for their use which should not go against the provisions set in the license
   3. You may re-distribute the weights and use the model commercially and/or as a service. If you do, please be aware you have to include the same use restrictions as the ones in the license and share a copy of the CreativeML OpenRAIL-M to all your users (please read the license entirely and carefully)
   Please read the full license here: https://huggingface.co/spaces/CompVis/stable-diffusion-license
 extra_gated_fields:
  I have read the License and agree with its terms: checkbox
 # Stable Diffusion v1-4 Model Card
 Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
+For more information about how Stable Diffusion functions, please have a look at [🤗's Stable Diffusion with D🧨iffusers blog](https://huggingface.co/blog/stable_diffusion).
+The **Stable-Diffusion-v1-4** checkpoint was initialized with the weights of the [Stable-Diffusion-v1-3](https:/steps/huggingface.co/CompVis/stable-diffusion-v1-3)
+checkpoint and subsequently fine-tuned on X steps on Y with Z.
 ## Model Details
 - **Developed by:** Robin Rombach, Patrick Esser
 We recommend using [🤗's Diffusers library](https://github.com/huggingface/diffusers) to run Stable Diffusion.
 ```bash
 pip install --upgrade diffusers transformers scipy
 ```
 huggingface-cli login
 ```
+Running the pipeline with the default PLMS scheduler:
 ```python
 import torch
 from torch import autocast
 model_id = "CompVis/stable-diffusion-v1-4"
 device = "cuda"
+generator = torch.Generator(device=device).manual_seed(0)
 pipe = StableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True)
 pipe = pipe.to(device)
+prompt = "a photograph of an astronaut riding a horse"
 with autocast("cuda"):
+    image = pipe(prompt, generator=generator)["sample"][0]  # image here is in PIL format
+image.save(f"astronaut_rides_horse.png")
 ```
 To swap out the noise scheduler, pass it to `from_pretrained`:
 scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
 pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, use_auth_token=True)
 pipe = pipe.to("cuda")
 ```
 # Uses
   [LAION-5B](https://laion.ai/blog/laion-5b/) which contains adult material
   and is not fit for product use without additional safety mechanisms and
   considerations.
 ### Bias
 This affects the overall output of the model, as white and western cultures are often set as the default. Further, the
 ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts.
 ## Training
 - [`stable-diffusion-v1-2`](https://huggingface.co/CompVis/stable-diffusion-v1-2): Resumed from `stable-diffusion-v1-1`.
   515,000 steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
 filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
+- [`stable-diffusion-v1-3`](https://huggingface.co/CompVis/stable-diffusion-v1-3): Resumed from `stable-diffusion-v1-2`. 195,000 steps at resolution `512x512` on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598)
+- [**`stable-diffusion-v1-4`**](https://huggingface.co/CompVis/stable-diffusion-v1-4) *To-fill-here*
 - **Hardware:** 32 x 8 x A100 GPUs
 - **Optimizer:** AdamW
 5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling
 steps show the relative improvements of the checkpoints:
+![pareto](v1-variants-scores.jpg)
 Evaluated using 50 PLMS steps and 10000 random prompts from the COCO2017 validation set, evaluated at 512x512 resolution.  Not optimized for FID scores.
 ## Environmental Impact
     }
 ```
+*This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*

model_index.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "_class_name": "StableDiffusionPipeline",
-  "_diffusers_version": "0.2.2",
   "feature_extractor": [
     "transformers",
     "CLIPFeatureExtractor"

 {
   "_class_name": "StableDiffusionPipeline",
+  "_diffusers_version": "0.2.3",
   "feature_extractor": [
     "transformers",
     "CLIPFeatureExtractor"

safety_checker/config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "./safety_module",
   "architectures": [
     "StableDiffusionSafetyChecker"
   ],
@@ -68,6 +68,7 @@
     "sep_token_id": null,
     "task_specific_params": null,
     "temperature": 1.0,
     "tie_encoder_decoder": false,
     "tie_word_embeddings": true,
     "tokenizer_class": null,
@@ -75,7 +76,7 @@
     "top_p": 1.0,
     "torch_dtype": null,
     "torchscript": false,
-    "transformers_version": "4.21.0.dev0",
     "typical_p": 1.0,
     "use_bfloat16": false,
     "vocab_size": 49408
@@ -86,7 +87,7 @@
     "num_attention_heads": 12,
     "num_hidden_layers": 12
   },
-  "torch_dtype": "float32",
   "transformers_version": null,
   "vision_config": {
     "_name_or_path": "",
@@ -133,6 +134,7 @@
     "num_attention_heads": 16,
     "num_beam_groups": 1,
     "num_beams": 1,
     "num_hidden_layers": 24,
     "num_return_sequences": 1,
     "output_attentions": false,
@@ -150,6 +152,7 @@
     "sep_token_id": null,
     "task_specific_params": null,
     "temperature": 1.0,
     "tie_encoder_decoder": false,
     "tie_word_embeddings": true,
     "tokenizer_class": null,
@@ -157,7 +160,7 @@
     "top_p": 1.0,
     "torch_dtype": null,
     "torchscript": false,
-    "transformers_version": "4.21.0.dev0",
     "typical_p": 1.0,
     "use_bfloat16": false
   },

 {
+  "_name_or_path": "./safety_checker",
   "architectures": [
     "StableDiffusionSafetyChecker"
   ],
     "sep_token_id": null,
     "task_specific_params": null,
     "temperature": 1.0,
+    "tf_legacy_loss": false,
     "tie_encoder_decoder": false,
     "tie_word_embeddings": true,
     "tokenizer_class": null,
     "top_p": 1.0,
     "torch_dtype": null,
     "torchscript": false,
+    "transformers_version": "4.21.1",
     "typical_p": 1.0,
     "use_bfloat16": false,
     "vocab_size": 49408
     "num_attention_heads": 12,
     "num_hidden_layers": 12
   },
+  "torch_dtype": "float16",
   "transformers_version": null,
   "vision_config": {
     "_name_or_path": "",
     "num_attention_heads": 16,
     "num_beam_groups": 1,
     "num_beams": 1,
+    "num_channels": 3,
     "num_hidden_layers": 24,
     "num_return_sequences": 1,
     "output_attentions": false,
     "sep_token_id": null,
     "task_specific_params": null,
     "temperature": 1.0,
+    "tf_legacy_loss": false,
     "tie_encoder_decoder": false,
     "tie_word_embeddings": true,
     "tokenizer_class": null,
     "top_p": 1.0,
     "torch_dtype": null,
     "torchscript": false,
+    "transformers_version": "4.21.1",
     "typical_p": 1.0,
     "use_bfloat16": false
   },

safety_checker/pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:193490b58ef62739077262e833bf091c66c29488058681ac25cf7df3d8190974
-size 1216061799

 version https://git-lfs.github.com/spec/v1
+oid sha256:1d37ca6e57ace94e4c2f03ed0f67b6dc83e1ef1160892074917aa68b28e2afc1
+size 608098599

scheduler/scheduler_config.json CHANGED Viewed

@@ -1,13 +1,9 @@
 {
   "_class_name": "PNDMScheduler",
-  "_diffusers_version": "0.7.0.dev0",
   "beta_end": 0.012,
   "beta_schedule": "scaled_linear",
   "beta_start": 0.00085,
   "num_train_timesteps": 1000,
-  "set_alpha_to_one": false,
-  "skip_prk_steps": true,
-  "steps_offset": 1,
-  "trained_betas": null,
-  "clip_sample": false
 }

 {
   "_class_name": "PNDMScheduler",
+  "_diffusers_version": "0.2.3",
   "beta_end": 0.012,
   "beta_schedule": "scaled_linear",
   "beta_start": 0.00085,
   "num_train_timesteps": 1000,
+  "skip_prk_steps": true
 }

text_encoder/config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "openai/clip-vit-large-patch14",
   "architectures": [
     "CLIPTextModel"
   ],
@@ -18,7 +18,7 @@
   "num_attention_heads": 12,
   "num_hidden_layers": 12,
   "pad_token_id": 1,
-  "torch_dtype": "float32",
-  "transformers_version": "4.21.0.dev0",
   "vocab_size": 49408
 }

 {
+  "_name_or_path": "./text_encoder",
   "architectures": [
     "CLIPTextModel"
   ],
   "num_attention_heads": 12,
   "num_hidden_layers": 12,
   "pad_token_id": 1,
+  "torch_dtype": "float16",
+  "transformers_version": "4.21.1",
   "vocab_size": 49408
 }

text_encoder/pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:770a47a9ffdcfda0b05506a7888ed714d06131d60267e6cf52765d61cf59fd67
-size 492305335

 version https://git-lfs.github.com/spec/v1
+oid sha256:88bd85efb0f84e70521633f578715afb2873db4f2615fdfb1f66e99934715865
+size 246184375

tokenizer/tokenizer_config.json CHANGED Viewed

@@ -19,7 +19,7 @@
   },
   "errors": "replace",
   "model_max_length": 77,
-  "name_or_path": "openai/clip-vit-large-patch14",
   "pad_token": "<|endoftext|>",
   "special_tokens_map_file": "./special_tokens_map.json",
   "tokenizer_class": "CLIPTokenizer",

   },
   "errors": "replace",
   "model_max_length": 77,
+  "name_or_path": "./tokenizer",
   "pad_token": "<|endoftext|>",
   "special_tokens_map_file": "./special_tokens_map.json",
   "tokenizer_class": "CLIPTokenizer",

unet/config.json CHANGED Viewed

@@ -1,6 +1,7 @@
 {
   "_class_name": "UNet2DConditionModel",
-  "_diffusers_version": "0.2.2",
   "act_fn": "silu",
   "attention_head_dim": 8,
   "block_out_channels": [

 {
   "_class_name": "UNet2DConditionModel",
+  "_diffusers_version": "0.2.3",
+  "_name_or_path": "./unet",
   "act_fn": "silu",
   "attention_head_dim": 8,
   "block_out_channels": [

unet/diffusion_pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:62d48b4d841a3178511fa453df0dae59b22089ace64609cc9d5353d0a7f37c26
-size 3438354725

 version https://git-lfs.github.com/spec/v1
+oid sha256:d98edd280d5e040ee77f5802b8e3be3513de757335d1dedc4e495647e7c2d573
+size 1719312805

vae/config.json CHANGED Viewed

@@ -1,6 +1,7 @@
 {
   "_class_name": "AutoencoderKL",
-  "_diffusers_version": "0.2.2",
   "act_fn": "silu",
   "block_out_channels": [
     128,

 {
   "_class_name": "AutoencoderKL",
+  "_diffusers_version": "0.2.3",
+  "_name_or_path": "./vae",
   "act_fn": "silu",
   "block_out_channels": [
     128,

vae/diffusion_pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1b134cded8eb78b184aefb8805b6b572f36fa77b255c483665dda931fa0130c5
-size 334707217

 version https://git-lfs.github.com/spec/v1
+oid sha256:51c8904bc921e1e6f354b5fa8e99a1c82ead2f0540114de21557b8abfbb24ad0
+size 167399505