patrickvonplaten commited on
Commit
c15a211
1 Parent(s): 56bf7e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -48
README.md CHANGED
@@ -6,8 +6,14 @@ tags:
6
  inference: false
7
  ---
8
 
9
- # Stable Diffusion v1 Model Card
10
- This model card focuses on the model associated with the Stable Diffusion model, available [here](https://github.com/CompVis/stable-diffusion).
 
 
 
 
 
 
11
 
12
  ## Model Details
13
  - **Developed by:** Robin Rombach, Patrick Esser
@@ -27,28 +33,36 @@ This model card focuses on the model associated with the Stable Diffusion model,
27
  pages = {10684-10695}
28
  }
29
 
30
- ## Usage examples
 
 
31
 
32
  ```bash
33
  pip install --upgrade diffusers transformers scipy
34
  ```
35
 
36
  Run this command to log in with your HF Hub token if you haven't before:
 
37
  ```bash
38
  huggingface-cli login
39
  ```
40
 
41
  Running the pipeline with the default PLMS scheduler:
42
  ```python
 
43
  from torch import autocast
44
  from diffusers import StableDiffusionPipeline
45
 
46
- model_id = "CompVis/stable-diffusion-v1-2-diffusers"
47
- pipe = StableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True).to("cuda")
 
 
 
 
48
 
49
  prompt = "a photograph of an astronaut riding a horse"
50
  with autocast("cuda"):
51
- image = pipe(prompt, guidance_scale=7)["sample"][0] # image here is in PIL format
52
 
53
  image.save(f"astronaut_rides_horse.png")
54
  ```
@@ -58,13 +72,13 @@ To swap out the noise scheduler, pass it to `from_pretrained`:
58
  ```python
59
  from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler
60
 
61
- model_id = "CompVis/stable-diffusion-v1-2-diffusers"
62
  # Use the K-LMS scheduler here instead
63
  scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
64
- pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, use_auth_token=True).to("cuda")
 
65
  ```
66
 
67
-
68
  # Uses
69
 
70
  ## Direct Use
@@ -84,8 +98,10 @@ _Note: This section is taken from the [DALLE-MINI model card](https://huggingfac
84
 
85
 
86
  The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
 
87
  #### Out-of-Scope Use
88
  The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
 
89
  #### Misuse and Malicious Use
90
  Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
91
 
@@ -114,6 +130,7 @@ Using the model to generate content that is cruel to individuals is a misuse of
114
  considerations.
115
 
116
  ### Bias
 
117
  While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
118
  Stable Diffusion v1 was trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/),
119
  which consists of images that are primarily limited to English descriptions.
@@ -124,29 +141,29 @@ ability of the model to generate content with non-English prompts is significant
124
 
125
  ## Training
126
 
127
- **Training Data**
128
  The model developers used the following dataset for training the model:
129
 
130
  - LAION-2B (en) and subsets thereof (see next section)
131
 
132
- **Training Procedure**
133
- Stable Diffusion v1 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
134
 
135
  - Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
136
  - Text prompts are encoded through a ViT-L/14 text-encoder.
137
  - The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
138
  - The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
139
 
140
- We currently provide three checkpoints, `sd-v1-1.ckpt`, `sd-v1-2.ckpt` and `sd-v1-3.ckpt`,
141
- which were trained as follows,
142
-
143
- - `sd-v1-1.ckpt`: 237k steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
144
- 194k steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
145
- - `sd-v1-2.ckpt`: Resumed from `sd-v1-1.ckpt`.
146
- 515k steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
147
  filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
148
- - `sd-v1-3.ckpt`: Resumed from `sd-v1-2.ckpt`. 195k steps at resolution `512x512` on "laion-improved-aesthetics" and 10\% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
 
149
 
 
150
 
151
  - **Hardware:** 32 x 8 x A100 GPUs
152
  - **Optimizer:** AdamW
@@ -173,33 +190,6 @@ Based on that information, we estimate the following CO2 emissions using the [Ma
173
  - **Compute Region:** US-east
174
  - **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 11250 kg CO2 eq.
175
 
176
- ## Usage
177
-
178
- ### Setup
179
-
180
- - Install `diffusers` with
181
-
182
- `pip install -U git+https://github.com/huggingface/diffusers.git`
183
- - Install `transformers` with
184
-
185
- `pip install transformers`
186
-
187
- ```python
188
- import torch
189
- from diffusers import StableDiffusionPipeline
190
-
191
- pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-2-diffusers")
192
-
193
- prompt = "19th Century wooden engraving of Elon musk"
194
-
195
- seed = torch.manual_seed(1024)
196
- images = pipe([prompt], num_inference_steps=50, guidance_scale=7.5, generator=seed)["sample"]
197
-
198
- # save images
199
- for idx, image in enumerate(images):
200
- image.save(f"image-{idx}.png")
201
- ```
202
-
203
 
204
  ## Citation
205
 
@@ -214,4 +204,4 @@ for idx, image in enumerate(images):
214
  }
215
  ```
216
 
217
- *This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*
 
6
  inference: false
7
  ---
8
 
9
+ # Stable Diffusion v1-2 Model Card
10
+
11
+ Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
12
+ For more information about how Stable Diffusion functions, please have a look at [🤗's Stable Diffusion with D🧨iffusers blog](hf.co/blog/stable_diffusion).
13
+
14
+ The **Stable-Diffusion-v1-2** checkpoint was initialized with the weights of the [Stable-Diffusion-v1-1](https:/steps/huggingface.co/CompVis/stable-diffusion-v1-1)
15
+ checkpoint and subsequently fine-tuned on 515,000 steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
16
+ filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`.
17
 
18
  ## Model Details
19
  - **Developed by:** Robin Rombach, Patrick Esser
 
33
  pages = {10684-10695}
34
  }
35
 
36
+ ## Examples
37
+
38
+ We recommend using [🤗's Diffusers library](https://github.com/huggingface/diffusers) to run Stable Diffusion.
39
 
40
  ```bash
41
  pip install --upgrade diffusers transformers scipy
42
  ```
43
 
44
  Run this command to log in with your HF Hub token if you haven't before:
45
+
46
  ```bash
47
  huggingface-cli login
48
  ```
49
 
50
  Running the pipeline with the default PLMS scheduler:
51
  ```python
52
+ import torch
53
  from torch import autocast
54
  from diffusers import StableDiffusionPipeline
55
 
56
+ model_id = "CompVis/stable-diffusion-v1-2"
57
+ device = "cuda"
58
+
59
+ generator = torch.Generator(device=device).manual_seed(0)
60
+ pipe = StableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True)
61
+ pipe = pipe.to(device)
62
 
63
  prompt = "a photograph of an astronaut riding a horse"
64
  with autocast("cuda"):
65
+ image = pipe(prompt)["sample"][0] # image here is in PIL format
66
 
67
  image.save(f"astronaut_rides_horse.png")
68
  ```
 
72
  ```python
73
  from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler
74
 
75
+ model_id = "CompVis/stable-diffusion-v1-2"
76
  # Use the K-LMS scheduler here instead
77
  scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
78
+ pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, use_auth_token=True)
79
+ pipe = pipe.to("cuda")
80
  ```
81
 
 
82
  # Uses
83
 
84
  ## Direct Use
 
98
 
99
 
100
  The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
101
+
102
  #### Out-of-Scope Use
103
  The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
104
+
105
  #### Misuse and Malicious Use
106
  Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
107
 
 
130
  considerations.
131
 
132
  ### Bias
133
+
134
  While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
135
  Stable Diffusion v1 was trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/),
136
  which consists of images that are primarily limited to English descriptions.
 
141
 
142
  ## Training
143
 
144
+ ### Training Data
145
  The model developers used the following dataset for training the model:
146
 
147
  - LAION-2B (en) and subsets thereof (see next section)
148
 
149
+ ### Training Procedure
150
+ Stable Diffusion v1-4 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
151
 
152
  - Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
153
  - Text prompts are encoded through a ViT-L/14 text-encoder.
154
  - The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
155
  - The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
156
 
157
+ We currently provide four checkpoints, which were trained as follows.
158
+ - [`stable-diffusion-v1-1`](https://huggingface.co/CompVis/stable-diffusion-v1-1): 237,000 steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
159
+ 194,000 steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
160
+ - [`stable-diffusion-v1-2`](https://huggingface.co/CompVis/stable-diffusion-v1-2): Resumed from `stable-diffusion-v1-1`.
161
+ 515,000 steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
 
 
162
  filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
163
+ - [`stable-diffusion-v1-3`](https://huggingface.co/CompVis/stable-diffusion-v1-3): Resumed from `stable-diffusion-v1-2`. 195,000 steps at resolution `512x512` on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598)
164
+ - [**`stable-diffusion-v1-4`**](https://huggingface.co/CompVis/stable-diffusion-v1-4) *To-fill-here*
165
 
166
+ ### Training details
167
 
168
  - **Hardware:** 32 x 8 x A100 GPUs
169
  - **Optimizer:** AdamW
 
190
  - **Compute Region:** US-east
191
  - **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 11250 kg CO2 eq.
192
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
 
194
  ## Citation
195
 
 
204
  }
205
  ```
206
 
207
+ *This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*