patrickvonplaten commited on
Commit
6539942
1 Parent(s): e36e180

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -46
README.md CHANGED
@@ -6,8 +6,13 @@ tags:
6
  inference: false
7
  ---
8
 
9
- # Stable Diffusion v1 Model Card
10
- This model card focuses on the model associated with the Stable Diffusion model, available [here](https://github.com/CompVis/stable-diffusion).
 
 
 
 
 
11
 
12
  ## Model Details
13
  - **Developed by:** Robin Rombach, Patrick Esser
@@ -27,28 +32,36 @@ This model card focuses on the model associated with the Stable Diffusion model,
27
  pages = {10684-10695}
28
  }
29
 
30
- ## Usage examples
 
 
31
 
32
  ```bash
33
  pip install --upgrade diffusers transformers scipy
34
  ```
35
 
36
  Run this command to log in with your HF Hub token if you haven't before:
 
37
  ```bash
38
  huggingface-cli login
39
  ```
40
 
41
  Running the pipeline with the default PLMS scheduler:
42
  ```python
 
43
  from torch import autocast
44
  from diffusers import StableDiffusionPipeline
45
 
46
- model_id = "CompVis/stable-diffusion-v1-1-diffusers"
47
- pipe = StableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True).to("cuda")
 
 
 
 
48
 
49
  prompt = "a photograph of an astronaut riding a horse"
50
  with autocast("cuda"):
51
- image = pipe(prompt, guidance_scale=7)["sample"][0] # image here is in PIL format
52
 
53
  image.save(f"astronaut_rides_horse.png")
54
  ```
@@ -58,10 +71,11 @@ To swap out the noise scheduler, pass it to `from_pretrained`:
58
  ```python
59
  from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler
60
 
61
- model_id = "CompVis/stable-diffusion-v1-1-diffusers"
62
  # Use the K-LMS scheduler here instead
63
  scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
64
- pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, use_auth_token=True).to("cuda")
 
65
  ```
66
 
67
  # Uses
@@ -83,8 +97,10 @@ _Note: This section is taken from the [DALLE-MINI model card](https://huggingfac
83
 
84
 
85
  The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
 
86
  #### Out-of-Scope Use
87
  The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
 
88
  #### Misuse and Malicious Use
89
  Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
90
 
@@ -113,6 +129,7 @@ Using the model to generate content that is cruel to individuals is a misuse of
113
  considerations.
114
 
115
  ### Bias
 
116
  While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
117
  Stable Diffusion v1 was trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/),
118
  which consists of images that are primarily limited to English descriptions.
@@ -129,23 +146,27 @@ The model developers used the following dataset for training the model:
129
  - LAION-2B (en) and subsets thereof (see next section)
130
 
131
  **Training Procedure**
132
- Stable Diffusion v1 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
133
 
134
  - Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
135
  - Text prompts are encoded through a ViT-L/14 text-encoder.
136
  - The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
137
  - The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
138
 
139
- We currently provide three checkpoints, `sd-v1-1.ckpt`, `sd-v1-2.ckpt` and `sd-v1-3.ckpt`,
140
- which were trained as follows,
141
-
142
- - `sd-v1-1.ckpt`: 237k steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
143
- 194k steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
144
- - `sd-v1-2.ckpt`: Resumed from `sd-v1-1.ckpt`.
145
- 515k steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
 
 
 
 
146
  filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
147
- - `sd-v1-3.ckpt`: Resumed from `sd-v1-2.ckpt`. 195k steps at resolution `512x512` on "laion-improved-aesthetics" and 10\% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
148
-
149
 
150
  - **Hardware:** 32 x 8 x A100 GPUs
151
  - **Optimizer:** AdamW
@@ -172,33 +193,6 @@ Based on that information, we estimate the following CO2 emissions using the [Ma
172
  - **Compute Region:** US-east
173
  - **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 11250 kg CO2 eq.
174
 
175
- ## Usage
176
-
177
- ### Setup
178
-
179
- - Install `diffusers` with
180
-
181
- `pip install -U git+https://github.com/huggingface/diffusers.git`
182
- - Install `transformers` with
183
-
184
- `pip install transformers`
185
-
186
- ```python
187
- import torch
188
- from diffusers import StableDiffusionPipeline
189
-
190
- pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-1-diffusers")
191
-
192
- prompt = "19th Century wooden engraving of Elon musk"
193
-
194
- seed = torch.manual_seed(1024)
195
- images = pipe([prompt], num_inference_steps=50, guidance_scale=7.5, generator=seed)["sample"]
196
-
197
- # save images
198
- for idx, image in enumerate(images):
199
- image.save(f"image-{idx}.png")
200
- ```
201
-
202
 
203
  ## Citation
204
 
@@ -213,4 +207,4 @@ for idx, image in enumerate(images):
213
  }
214
  ```
215
 
216
- *This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*
 
6
  inference: false
7
  ---
8
 
9
+ # Stable Diffusion v1-1 Model Card
10
+
11
+ Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
12
+ For more information about how Stable Diffusion functions, please have a look at [🤗's Stable Diffusion with D🧨iffusers blog](hf.co/blog/stable_diffusion).
13
+
14
+ The **Stable-Diffusion-v1-1** was trained on 237,000 steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en), followed by
15
+ 194,000 steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`). For more information, please refer to [Training](#training).
16
 
17
  ## Model Details
18
  - **Developed by:** Robin Rombach, Patrick Esser
 
32
  pages = {10684-10695}
33
  }
34
 
35
+ ## Examples
36
+
37
+ We recommend using [🤗's Diffusers library](https://github.com/huggingface/diffusers) to run Stable Diffusion.
38
 
39
  ```bash
40
  pip install --upgrade diffusers transformers scipy
41
  ```
42
 
43
  Run this command to log in with your HF Hub token if you haven't before:
44
+
45
  ```bash
46
  huggingface-cli login
47
  ```
48
 
49
  Running the pipeline with the default PLMS scheduler:
50
  ```python
51
+ import torch
52
  from torch import autocast
53
  from diffusers import StableDiffusionPipeline
54
 
55
+ model_id = "CompVis/stable-diffusion-v1-1"
56
+ device = "cuda"
57
+
58
+ generator = torch.Generator(device=device).manual_seed(0)
59
+ pipe = StableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True)
60
+ pipe = pipe.to(device)
61
 
62
  prompt = "a photograph of an astronaut riding a horse"
63
  with autocast("cuda"):
64
+ image = pipe(prompt)["sample"][0] # image here is in PIL format
65
 
66
  image.save(f"astronaut_rides_horse.png")
67
  ```
 
71
  ```python
72
  from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler
73
 
74
+ model_id = "CompVis/stable-diffusion-v1-1"
75
  # Use the K-LMS scheduler here instead
76
  scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
77
+ pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, use_auth_token=True)
78
+ pipe = pipe.to("cuda")
79
  ```
80
 
81
  # Uses
 
97
 
98
 
99
  The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
100
+
101
  #### Out-of-Scope Use
102
  The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
103
+
104
  #### Misuse and Malicious Use
105
  Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
106
 
 
129
  considerations.
130
 
131
  ### Bias
132
+
133
  While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
134
  Stable Diffusion v1 was trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/),
135
  which consists of images that are primarily limited to English descriptions.
 
146
  - LAION-2B (en) and subsets thereof (see next section)
147
 
148
  **Training Procedure**
149
+ Stable Diffusion v1-4 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
150
 
151
  - Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
152
  - Text prompts are encoded through a ViT-L/14 text-encoder.
153
  - The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
154
  - The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
155
 
156
+ We currently provide four checkpoints,
157
+ - [`stable-diffusion-v1-1`](https://huggingface.co/CompVis/stable-diffusion-v1-1),
158
+ - [`stable-diffusion-v1-2`](https://huggingface.co/CompVis/stable-diffusion-v1-2),
159
+ - [`stable-diffusion-v1-3`](https://huggingface.co/CompVis/stable-diffusion-v1-3), and
160
+ - [`stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4).
161
+
162
+ The checkpoints were trained as follows:
163
+ - `stable-diffusion-v1-1`: 237,000 steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
164
+ 194,000 steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
165
+ - `stable-diffusion-v1-2`: Resumed from `stable-diffusion-v1-1`.
166
+ 515,000 steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
167
  filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
168
+ - `stable-diffusion-v1-3`: Resumed from `stable-diffusion-v1-2`. 195,000 steps at resolution `512x512` on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598)
169
+ - *`stable-diffusion-v1-4`*: ...
170
 
171
  - **Hardware:** 32 x 8 x A100 GPUs
172
  - **Optimizer:** AdamW
 
193
  - **Compute Region:** US-east
194
  - **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 11250 kg CO2 eq.
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
  ## Citation
198
 
 
207
  }
208
  ```
209
 
210
+ *This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*