dn6 HF staff commited on
Commit
143869a
1 Parent(s): 3a1aa38

update README

Browse files
Files changed (1) hide show
  1. README.md +161 -42
README.md CHANGED
@@ -10,13 +10,13 @@ license_link: LICENSE
10
  <!-- Provide a quick summary of what the model is/does. -->
11
  <img src="figures/collage_1.jpg" width="800">
12
 
13
- This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
14
- difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
15
- important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes.
16
- How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
17
- encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
18
- 1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
19
- highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
20
  Diffusion 1.5. <br> <br>
21
  Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
22
  like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
@@ -41,60 +41,179 @@ For research purposes, we recommend our `StableCascade` Github repository (https
41
  ### Model Overview
42
  Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
43
  hence the name "Stable Cascade".
44
- Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
45
- However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
46
- spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
47
- a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
48
- image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
49
  for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.
50
 
51
  <img src="figures/model-overview.jpg" width="600">
52
 
53
- For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with
54
- a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was
55
- put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve
56
- great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the
57
- best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to
58
  its small size.
59
 
60
  ## Evaluation
61
  <img height="300" src="figures/comparison.png"/>
62
- According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all
63
- comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and
64
- aesthetic prompts. Specifically, Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference
65
  steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps).
66
 
67
  ## Code Example
68
 
 
 
 
 
69
  ```python
70
  import torch
71
  from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline
72
 
73
- device = "cuda"
74
- num_images_per_prompt = 2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
- prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16).to(device)
77
- decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", torch_dtype=torch.float16).to(device)
78
 
79
- prompt = "Anthropomorphic cat dressed as a pilot"
80
  negative_prompt = ""
81
 
82
- with torch.cuda.amp.autocast(dtype=dtype):
83
- prior_output = prior(
84
- prompt=prompt,
85
- height=1024,
86
- width=1024,
87
- negative_prompt=negative_prompt,
88
- guidance_scale=4.0,
89
- num_images_per_prompt=num_images_per_prompt,
90
- )
91
- decoder_output = decoder(
92
- image_embeddings=prior_output.image_embeddings,
93
- prompt=prompt,
94
- negative_prompt=negative_prompt,
95
- guidance_scale=0.0,
96
- output_type="pil",
97
- ).images
98
  ```
99
 
100
  ## Uses
@@ -113,7 +232,7 @@ Excluded uses are described below.
113
 
114
  ### Out-of-Scope Use
115
 
116
- The model was not trained to be factual or true representations of people or events,
117
  and therefore using the model to generate such content is out-of-scope for the abilities of this model.
118
  The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).
119
 
 
10
  <!-- Provide a quick summary of what the model is/does. -->
11
  <img src="figures/collage_1.jpg" width="800">
12
 
13
+ This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
14
+ difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
15
+ important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes.
16
+ How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
17
+ encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
18
+ 1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
19
+ highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
20
  Diffusion 1.5. <br> <br>
21
  Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
22
  like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
 
41
  ### Model Overview
42
  Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
43
  hence the name "Stable Cascade".
44
+ Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
45
+ However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
46
+ spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
47
+ a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
48
+ image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
49
  for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.
50
 
51
  <img src="figures/model-overview.jpg" width="600">
52
 
53
+ For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with
54
+ a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was
55
+ put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve
56
+ great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the
57
+ best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to
58
  its small size.
59
 
60
  ## Evaluation
61
  <img height="300" src="figures/comparison.png"/>
62
+ According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all
63
+ comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and
64
+ aesthetic prompts. Specifically, Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference
65
  steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps).
66
 
67
  ## Code Example
68
 
69
+ ```shell
70
+ pip install diffusers
71
+ ```
72
+
73
  ```python
74
  import torch
75
  from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline
76
 
77
+ prompt = "an image of a shiba inu, donning a spacesuit and helmet"
78
+ negative_prompt = ""
79
+
80
+ prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16)
81
+ decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16)
82
+
83
+ prior.enable_model_cpu_offload()
84
+ prior_output = prior(
85
+ prompt=prompt,
86
+ height=1024,
87
+ width=1024,
88
+ negative_prompt=negative_prompt,
89
+ guidance_scale=4.0,
90
+ num_images_per_prompt=1,
91
+ num_inference_steps=20
92
+ )
93
+
94
+ decoder.enable_model_cpu_offload()
95
+ decoder_output = decoder(
96
+ image_embeddings=prior_output.image_embeddings.to(torch.float16),
97
+ prompt=prompt,
98
+ negative_prompt=negative_prompt,
99
+ guidance_scale=0.0,
100
+ output_type="pil",
101
+ num_inference_steps=10
102
+ ).images[0]
103
+ decoder_output.save("cascade.png")
104
+ ```
105
+
106
+ ### Using the Lite Version of the Stage B and Stage C models
107
+
108
+ ```python
109
+ import torch
110
+ from diffusers import (
111
+ StableCascadeDecoderPipeline,
112
+ StableCascadePriorPipeline,
113
+ StableCascadeUNet,
114
+ )
115
+
116
+ prompt = "an image of a shiba inu, donning a spacesuit and helmet"
117
+ negative_prompt = ""
118
+
119
+ prior_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior_lite")
120
+ decoder_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder_lite")
121
+
122
+ prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet)
123
+ decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet)
124
+
125
+ prior.enable_model_cpu_offload()
126
+ prior_output = prior(
127
+ prompt=prompt,
128
+ height=1024,
129
+ width=1024,
130
+ negative_prompt=negative_prompt,
131
+ guidance_scale=4.0,
132
+ num_images_per_prompt=1,
133
+ num_inference_steps=20
134
+ )
135
+
136
+ decoder.enable_model_cpu_offload()
137
+ decoder_output = decoder(
138
+ image_embeddings=prior_output.image_embeddings,
139
+ prompt=prompt,
140
+ negative_prompt=negative_prompt,
141
+ guidance_scale=0.0,
142
+ output_type="pil",
143
+ num_inference_steps=10
144
+ ).images[0]
145
+ decoder_output.save("cascade.png")
146
+ ```
147
+
148
+ ### Loading original checkpoints with `from_single_file`
149
+
150
+ Loading the original format checkpoints is supported via `from_single_file` method in the StableCascadeUNet.
151
+
152
+ ```python
153
+ import torch
154
+ from diffusers import (
155
+ StableCascadeDecoderPipeline,
156
+ StableCascadePriorPipeline,
157
+ StableCascadeUNet,
158
+ )
159
+
160
+ prompt = "an image of a shiba inu, donning a spacesuit and helmet"
161
+ negative_prompt = ""
162
+
163
+ prior_unet = StableCascadeUNet.from_single_file(
164
+ "https://huggingface.co/stabilityai/stable-cascade/resolve/main/stage_c_bf16.safetensors",
165
+ torch_dtype=torch.bfloat16
166
+ )
167
+ decoder_unet = StableCascadeUNet.from_single_file(
168
+ "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_bf16.safetensors",
169
+ torch_dtype=torch.bfloat16
170
+ )
171
+
172
+ prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet, torch_dtype=torch.bfloat16)
173
+ decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet, torch_dtype=torch.bfloat16)
174
+
175
+ prior.enable_model_cpu_offload()
176
+ prior_output = prior(
177
+ prompt=prompt,
178
+ height=1024,
179
+ width=1024,
180
+ negative_prompt=negative_prompt,
181
+ guidance_scale=4.0,
182
+ num_images_per_prompt=1,
183
+ num_inference_steps=20
184
+ )
185
+
186
+ decoder.enable_model_cpu_offload()
187
+ decoder_output = decoder(
188
+ image_embeddings=prior_output.image_embeddings,
189
+ prompt=prompt,
190
+ negative_prompt=negative_prompt,
191
+ guidance_scale=0.0,
192
+ output_type="pil",
193
+ num_inference_steps=10
194
+ ).images[0]
195
+ decoder_output.save("cascade-single-file.png")
196
+ ```
197
+
198
+ ### Using the `StableCascadeCombinedPipeline`
199
+
200
+ ```python
201
+ from diffusers import StableCascadeCombinedPipeline
202
 
203
+ pipe = StableCascadeCombinedPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.bfloat16)
 
204
 
205
+ prompt = "an image of a shiba inu, donning a spacesuit and helmet"
206
  negative_prompt = ""
207
 
208
+ pipe(
209
+ prompt="photorealistic portrait artwork of an floral robot with a dark night cyberpunk city background",
210
+ negative_prompt="",
211
+ num_inference_steps=10,
212
+ prior_num_inference_steps=20,
213
+ prior_guidance_scale=3.0,
214
+ width=1024,
215
+ height=1024,
216
+ ).images[0].save("cascade-combined.png")
 
 
 
 
 
 
 
217
  ```
218
 
219
  ## Uses
 
232
 
233
  ### Out-of-Scope Use
234
 
235
+ The model was not trained to be factual or true representations of people or events,
236
  and therefore using the model to generate such content is out-of-scope for the abilities of this model.
237
  The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).
238