valhalla dn6 HF staff commited on
Commit
edf1afb
1 Parent(s): e3aee2f

Update diffusers format weights (#44)

Browse files

- update diffusers weights (c6453e584d2a95f191eaeb5518e59818b04c4202)
- update diffusers weights (1b70b878d78ddfbae8fabb00f9df2a04e4b936a9)
- update model card (6dea633bdf22f1757317287de347f3e7d8511e0e)
- update (74c134989b2833b16cc5224046b94015de60cef0)
- update README (e6b0429a990ee8ed0f45aaa01c38967dcbb4204a)
- update (abc818bb0df76e3b603c6a8e1edfffd0cfdd7649)
- update README (306bac684746c9b91d6f065f6e9bc15716cdc2e4)
- update README (6c9811380ee0f0c5e4846b26ae72a6aecefd1303)
- update (c241f065d767cf1507439f2421c398e184abb8b8)


Co-authored-by: Dhruv Nair <dn6@users.noreply.huggingface.co>

README.md CHANGED
@@ -3,6 +3,8 @@ pipeline_tag: text-to-image
3
  license: other
4
  license_name: stable-cascade-nc-community
5
  license_link: LICENSE
 
 
6
  ---
7
 
8
  # Stable Cascade
@@ -10,13 +12,13 @@ license_link: LICENSE
10
  <!-- Provide a quick summary of what the model is/does. -->
11
  <img src="figures/collage_1.jpg" width="800">
12
 
13
- This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
14
- difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
15
- important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes.
16
- How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
17
- encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
18
- 1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
19
- highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
20
  Diffusion 1.5. <br> <br>
21
  Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
22
  like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
@@ -41,69 +43,181 @@ For research purposes, we recommend our `StableCascade` Github repository (https
41
  ### Model Overview
42
  Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
43
  hence the name "Stable Cascade".
44
- Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
45
- However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
46
- spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
47
- a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
48
- image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
49
  for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.
50
 
51
  <img src="figures/model-overview.jpg" width="600">
52
 
53
- For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with
54
- a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was
55
- put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve
56
- great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the
57
- best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to
58
  its small size.
59
 
60
  ## Evaluation
61
  <img height="300" src="figures/comparison.png"/>
62
- According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all
63
- comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and
64
- aesthetic prompts. Specifically, Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference
65
  steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps).
66
 
67
  ## Code Example
68
 
69
- **⚠️ Important**: For the code below to work, you have to install `diffusers` from this branch while the PR is WIP.
 
 
70
 
71
  ```shell
72
- pip install git+https://github.com/kashif/diffusers.git@wuerstchen-v3
73
  ```
74
 
75
  ```python
76
  import torch
77
  from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline
78
 
79
- device = "cuda"
80
- num_images_per_prompt = 2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
- prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16).to(device)
83
- decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", torch_dtype=torch.float16).to(device)
 
 
 
 
 
84
 
85
- prompt = "Anthropomorphic cat dressed as a pilot"
86
  negative_prompt = ""
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  prior_output = prior(
89
  prompt=prompt,
90
  height=1024,
91
  width=1024,
92
  negative_prompt=negative_prompt,
93
  guidance_scale=4.0,
94
- num_images_per_prompt=num_images_per_prompt,
95
  num_inference_steps=20
96
  )
 
 
97
  decoder_output = decoder(
98
- image_embeddings=prior_output.image_embeddings.half(),
99
  prompt=prompt,
100
  negative_prompt=negative_prompt,
101
  guidance_scale=0.0,
102
  output_type="pil",
103
  num_inference_steps=10
104
- ).images
 
 
 
 
 
 
 
105
 
106
- #Now decoder_output is a list with your PIL images
 
 
 
 
 
 
 
 
 
 
 
107
  ```
108
 
109
  ## Uses
@@ -122,7 +236,7 @@ Excluded uses are described below.
122
 
123
  ### Out-of-Scope Use
124
 
125
- The model was not trained to be factual or true representations of people or events,
126
  and therefore using the model to generate such content is out-of-scope for the abilities of this model.
127
  The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).
128
 
@@ -139,4 +253,4 @@ The model is intended for research purposes only.
139
 
140
  ## How to Get Started with the Model
141
 
142
- Check out https://github.com/Stability-AI/StableCascade
 
3
  license: other
4
  license_name: stable-cascade-nc-community
5
  license_link: LICENSE
6
+ prior:
7
+ - stabilityai/stable-cascade-prior
8
  ---
9
 
10
  # Stable Cascade
 
12
  <!-- Provide a quick summary of what the model is/does. -->
13
  <img src="figures/collage_1.jpg" width="800">
14
 
15
+ This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
16
+ difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
17
+ important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes.
18
+ How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
19
+ encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
20
+ 1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
21
+ highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
22
  Diffusion 1.5. <br> <br>
23
  Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
24
  like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
 
43
  ### Model Overview
44
  Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
45
  hence the name "Stable Cascade".
46
+ Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
47
+ However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
48
+ spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
49
+ a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
50
+ image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
51
  for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.
52
 
53
  <img src="figures/model-overview.jpg" width="600">
54
 
55
+ For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with
56
+ a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was
57
+ put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve
58
+ great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the
59
+ best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to
60
  its small size.
61
 
62
  ## Evaluation
63
  <img height="300" src="figures/comparison.png"/>
64
+ According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all
65
+ comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and
66
+ aesthetic prompts. Specifically, Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference
67
  steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps).
68
 
69
  ## Code Example
70
 
71
+ **Note:** In order to use the `torch.bfloat16` data type with the `StableCascadeDecoderPipeline` you need to have PyTorch 2.2.0 or higher installed. This also means that using the `StableCascadeCombinedPipeline` with `torch.bfloat16` requires PyTorch 2.2.0 or higher, since it calls the StableCascadeDecoderPipeline internally.
72
+
73
+ If it is not possible to install PyTorch 2.2.0 or higher in your environment, the `StableCascadeDecoderPipeline` can be used on its own with the torch.float16 data type. You can download the full precision or bf16 variant weights for the pipeline and cast the weights to torch.float16.
74
 
75
  ```shell
76
+ pip install diffusers
77
  ```
78
 
79
  ```python
80
  import torch
81
  from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline
82
 
83
+ prompt = "an image of a shiba inu, donning a spacesuit and helmet"
84
+ negative_prompt = ""
85
+
86
+ prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16)
87
+ decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16)
88
+
89
+ prior.enable_model_cpu_offload()
90
+ prior_output = prior(
91
+ prompt=prompt,
92
+ height=1024,
93
+ width=1024,
94
+ negative_prompt=negative_prompt,
95
+ guidance_scale=4.0,
96
+ num_images_per_prompt=1,
97
+ num_inference_steps=20
98
+ )
99
+
100
+ decoder.enable_model_cpu_offload()
101
+ decoder_output = decoder(
102
+ image_embeddings=prior_output.image_embeddings.to(torch.float16),
103
+ prompt=prompt,
104
+ negative_prompt=negative_prompt,
105
+ guidance_scale=0.0,
106
+ output_type="pil",
107
+ num_inference_steps=10
108
+ ).images[0]
109
+ decoder_output.save("cascade.png")
110
+ ```
111
+
112
+ ### Using the Lite Version of the Stage B and Stage C models
113
+
114
+ ```python
115
+ import torch
116
+ from diffusers import (
117
+ StableCascadeDecoderPipeline,
118
+ StableCascadePriorPipeline,
119
+ StableCascadeUNet,
120
+ )
121
+
122
+ prompt = "an image of a shiba inu, donning a spacesuit and helmet"
123
+ negative_prompt = ""
124
+
125
+ prior_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior_lite")
126
+ decoder_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder_lite")
127
+
128
+ prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet)
129
+ decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet)
130
+
131
+ prior.enable_model_cpu_offload()
132
+ prior_output = prior(
133
+ prompt=prompt,
134
+ height=1024,
135
+ width=1024,
136
+ negative_prompt=negative_prompt,
137
+ guidance_scale=4.0,
138
+ num_images_per_prompt=1,
139
+ num_inference_steps=20
140
+ )
141
+
142
+ decoder.enable_model_cpu_offload()
143
+ decoder_output = decoder(
144
+ image_embeddings=prior_output.image_embeddings,
145
+ prompt=prompt,
146
+ negative_prompt=negative_prompt,
147
+ guidance_scale=0.0,
148
+ output_type="pil",
149
+ num_inference_steps=10
150
+ ).images[0]
151
+ decoder_output.save("cascade.png")
152
+ ```
153
+
154
+ ### Loading original checkpoints with `from_single_file`
155
+
156
+ Loading the original format checkpoints is supported via `from_single_file` method in the StableCascadeUNet.
157
 
158
+ ```python
159
+ import torch
160
+ from diffusers import (
161
+ StableCascadeDecoderPipeline,
162
+ StableCascadePriorPipeline,
163
+ StableCascadeUNet,
164
+ )
165
 
166
+ prompt = "an image of a shiba inu, donning a spacesuit and helmet"
167
  negative_prompt = ""
168
 
169
+ prior_unet = StableCascadeUNet.from_single_file(
170
+ "https://huggingface.co/stabilityai/stable-cascade/resolve/main/stage_c_bf16.safetensors",
171
+ torch_dtype=torch.bfloat16
172
+ )
173
+ decoder_unet = StableCascadeUNet.from_single_file(
174
+ "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_bf16.safetensors",
175
+ torch_dtype=torch.bfloat16
176
+ )
177
+
178
+ prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet, torch_dtype=torch.bfloat16)
179
+ decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet, torch_dtype=torch.bfloat16)
180
+
181
+ prior.enable_model_cpu_offload()
182
  prior_output = prior(
183
  prompt=prompt,
184
  height=1024,
185
  width=1024,
186
  negative_prompt=negative_prompt,
187
  guidance_scale=4.0,
188
+ num_images_per_prompt=1,
189
  num_inference_steps=20
190
  )
191
+
192
+ decoder.enable_model_cpu_offload()
193
  decoder_output = decoder(
194
+ image_embeddings=prior_output.image_embeddings,
195
  prompt=prompt,
196
  negative_prompt=negative_prompt,
197
  guidance_scale=0.0,
198
  output_type="pil",
199
  num_inference_steps=10
200
+ ).images[0]
201
+ decoder_output.save("cascade-single-file.png")
202
+ ```
203
+
204
+ ### Using the `StableCascadeCombinedPipeline`
205
+
206
+ ```python
207
+ from diffusers import StableCascadeCombinedPipeline
208
 
209
+ pipe = StableCascadeCombinedPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.bfloat16)
210
+
211
+ prompt = "an image of a shiba inu, donning a spacesuit and helmet"
212
+ pipe(
213
+ prompt=prompt,
214
+ negative_prompt="",
215
+ num_inference_steps=10,
216
+ prior_num_inference_steps=20,
217
+ prior_guidance_scale=3.0,
218
+ width=1024,
219
+ height=1024,
220
+ ).images[0].save("cascade-combined.png")
221
  ```
222
 
223
  ## Uses
 
236
 
237
  ### Out-of-Scope Use
238
 
239
+ The model was not trained to be factual or true representations of people or events,
240
  and therefore using the model to generate such content is out-of-scope for the abilities of this model.
241
  The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).
242
 
 
253
 
254
  ## How to Get Started with the Model
255
 
256
+ Check out https://github.com/Stability-AI/StableCascade
decoder/config.json CHANGED
@@ -1,74 +1,83 @@
1
  {
2
- "_class_name": "StableCascadeUnet",
3
- "_diffusers_version": "0.26.0.dev0",
4
- "_name_or_path": "StableCascade/decoder",
5
- "block_repeat": [
 
 
 
 
 
6
  [
7
- 1,
8
- 1,
9
- 1,
10
- 1
11
  ],
12
  [
13
- 3,
14
- 3,
15
- 2,
16
- 2
17
- ]
18
- ],
19
- "blocks": [
20
  [
21
- 2,
22
- 6,
23
- 28,
24
- 6
25
  ],
26
  [
27
- 6,
28
- 28,
29
- 6,
30
- 2
31
  ]
32
  ],
33
- "c_clip_img": null,
34
- "c_clip_seq": 4,
35
- "c_clip_text": null,
36
- "c_clip_text_pooled": 1280,
37
- "c_cond": 1280,
38
- "c_effnet": 16,
39
- "c_hidden": [
40
- 320,
41
- 640,
42
- 1280,
43
- 1280
 
 
 
 
 
44
  ],
45
- "c_in": 4,
46
- "c_out": 4,
47
- "c_pixels": 3,
48
- "c_r": 64,
49
  "dropout": [
50
  0,
51
  0,
52
  0.1,
53
  0.1
54
  ],
 
 
55
  "kernel_size": 3,
56
- "level_config": [
57
- "CT",
58
- "CT",
59
- "CTA",
60
- "CTA"
61
- ],
62
- "nhead": [
63
- -1,
64
- -1,
65
  20,
66
  20
67
  ],
 
68
  "patch_size": 2,
 
69
  "self_attn": true,
70
  "switch_level": null,
71
- "t_conds": [
72
  "sca"
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  ]
74
  }
 
1
  {
2
+ "_class_name": "StableCascadeUNet",
3
+ "_diffusers_version": "0.27.0.dev0",
4
+ "block_out_channels": [
5
+ 320,
6
+ 640,
7
+ 1280,
8
+ 1280
9
+ ],
10
+ "block_types_per_layer": [
11
  [
12
+ "SDCascadeResBlock",
13
+ "SDCascadeTimestepBlock"
 
 
14
  ],
15
  [
16
+ "SDCascadeResBlock",
17
+ "SDCascadeTimestepBlock"
18
+ ],
 
 
 
 
19
  [
20
+ "SDCascadeResBlock",
21
+ "SDCascadeTimestepBlock",
22
+ "SDCascadeAttnBlock"
 
23
  ],
24
  [
25
+ "SDCascadeResBlock",
26
+ "SDCascadeTimestepBlock",
27
+ "SDCascadeAttnBlock"
 
28
  ]
29
  ],
30
+ "clip_image_in_channels": null,
31
+ "clip_seq": 4,
32
+ "clip_text_in_channels": null,
33
+ "clip_text_pooled_in_channels": 1280,
34
+ "conditioning_dim": 1280,
35
+ "down_blocks_repeat_mappers": [
36
+ 1,
37
+ 1,
38
+ 1,
39
+ 1
40
+ ],
41
+ "down_num_layers_per_block": [
42
+ 2,
43
+ 6,
44
+ 28,
45
+ 6
46
  ],
 
 
 
 
47
  "dropout": [
48
  0,
49
  0,
50
  0.1,
51
  0.1
52
  ],
53
+ "effnet_in_channels": 16,
54
+ "in_channels": 4,
55
  "kernel_size": 3,
56
+ "num_attention_heads": [
57
+ 0,
58
+ 0,
 
 
 
 
 
 
59
  20,
60
  20
61
  ],
62
+ "out_channels": 4,
63
  "patch_size": 2,
64
+ "pixel_mapper_in_channels": 3,
65
  "self_attn": true,
66
  "switch_level": null,
67
+ "timestep_conditioning_type": [
68
  "sca"
69
+ ],
70
+ "timestep_ratio_embedding_dim": 64,
71
+ "up_blocks_repeat_mappers": [
72
+ 3,
73
+ 3,
74
+ 2,
75
+ 2
76
+ ],
77
+ "up_num_layers_per_block": [
78
+ 6,
79
+ 28,
80
+ 6,
81
+ 2
82
  ]
83
  }
decoder/diffusion_pytorch_model.bf16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eaa05a862e2d50008426d7c5ffd08fb462cbf06f50098eea951d5a1e97ba0350
3
+ size 3126071088
decoder/diffusion_pytorch_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1f9575dfa6c2535ad65733d6257d17a7b1e1b54b7eafb251ce9556595f3bc0c9
3
- size 3126071088
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e428508e8c3654e778a1062836253a1794c775a68b75b3679743e9f20dbcef52
3
+ size 6251952232
decoder_lite/config.json ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "StableCascadeUNet",
3
+ "_diffusers_version": "0.27.0.dev0",
4
+ "block_out_channels": [
5
+ 320,
6
+ 576,
7
+ 1152,
8
+ 1152
9
+ ],
10
+ "block_types_per_layer": [
11
+ [
12
+ "SDCascadeResBlock",
13
+ "SDCascadeTimestepBlock"
14
+ ],
15
+ [
16
+ "SDCascadeResBlock",
17
+ "SDCascadeTimestepBlock"
18
+ ],
19
+ [
20
+ "SDCascadeResBlock",
21
+ "SDCascadeTimestepBlock",
22
+ "SDCascadeAttnBlock"
23
+ ],
24
+ [
25
+ "SDCascadeResBlock",
26
+ "SDCascadeTimestepBlock",
27
+ "SDCascadeAttnBlock"
28
+ ]
29
+ ],
30
+ "clip_image_in_channels": null,
31
+ "clip_seq": 4,
32
+ "clip_text_in_channels": null,
33
+ "clip_text_pooled_in_channels": 1280,
34
+ "conditioning_dim": 1280,
35
+ "down_blocks_repeat_mappers": [
36
+ 1,
37
+ 1,
38
+ 1,
39
+ 1
40
+ ],
41
+ "down_num_layers_per_block": [
42
+ 2,
43
+ 4,
44
+ 14,
45
+ 4
46
+ ],
47
+ "dropout": [
48
+ 0,
49
+ 0,
50
+ 0.1,
51
+ 0.1
52
+ ],
53
+ "effnet_in_channels": 16,
54
+ "in_channels": 4,
55
+ "kernel_size": 3,
56
+ "num_attention_heads": [
57
+ 0,
58
+ 9,
59
+ 18,
60
+ 18
61
+ ],
62
+ "out_channels": 4,
63
+ "patch_size": 2,
64
+ "pixel_mapper_in_channels": 3,
65
+ "self_attn": true,
66
+ "switch_level": null,
67
+ "timestep_conditioning_type": [
68
+ "sca"
69
+ ],
70
+ "timestep_ratio_embedding_dim": 64,
71
+ "up_blocks_repeat_mappers": [
72
+ 2,
73
+ 2,
74
+ 2,
75
+ 2
76
+ ],
77
+ "up_num_layers_per_block": [
78
+ 4,
79
+ 14,
80
+ 4,
81
+ 2
82
+ ]
83
+ }
decoder_lite/diffusion_pytorch_model.bf16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cdf99f9972e67abfbffcd9146be83ae0fd7307789619cefa4cf4a54cd62181e6
3
+ size 1399047416
decoder_lite/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7dcb3fc8b1e3f2a1756503472043a7d6810003a418b60b08674633b20b452ffb
3
+ size 2797989648
model_index.json CHANGED
@@ -1,10 +1,9 @@
1
  {
2
  "_class_name": "StableCascadeDecoderPipeline",
3
- "_diffusers_version": "0.26.0.dev0",
4
- "_name_or_path": "StableCascade/",
5
  "decoder": [
6
- "stable_cascade",
7
- "StableCascadeUnet"
8
  ],
9
  "latent_dim_scale": 10.67,
10
  "scheduler": [
 
1
  {
2
  "_class_name": "StableCascadeDecoderPipeline",
3
+ "_diffusers_version": "0.27.0.dev0",
 
4
  "decoder": [
5
+ "diffusers",
6
+ "StableCascadeUNet"
7
  ],
8
  "latent_dim_scale": 10.67,
9
  "scheduler": [
scheduler/scheduler_config.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "_class_name": "DDPMWuerstchenScheduler",
3
- "_diffusers_version": "0.26.0.dev0",
4
  "s": 0.008,
5
  "scaler": 1.0
6
  }
 
1
  {
2
  "_class_name": "DDPMWuerstchenScheduler",
3
+ "_diffusers_version": "0.27.0.dev0",
4
  "s": 0.008,
5
  "scaler": 1.0
6
  }
text_encoder/config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "StableCascade/text_encoder",
3
  "architectures": [
4
  "CLIPTextModelWithProjection"
5
  ],
@@ -20,6 +20,6 @@
20
  "pad_token_id": 1,
21
  "projection_dim": 1280,
22
  "torch_dtype": "bfloat16",
23
- "transformers_version": "4.38.0.dev0",
24
  "vocab_size": 49408
25
  }
 
1
  {
2
+ "_name_or_path": "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k",
3
  "architectures": [
4
  "CLIPTextModelWithProjection"
5
  ],
 
20
  "pad_token_id": 1,
21
  "projection_dim": 1280,
22
  "torch_dtype": "bfloat16",
23
+ "transformers_version": "4.38.2",
24
  "vocab_size": 49408
25
  }
text_encoder/model.bf16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:260e0127aca3c89db813637ae659ebb822cb07af71fedc16cbd980e9518dfdcd
3
+ size 1389382688
tokenizer/tokenizer.json CHANGED
@@ -1,14 +1,7 @@
1
  {
2
  "version": "1.0",
3
  "truncation": null,
4
- "padding": {
5
- "strategy": "BatchLongest",
6
- "direction": "Right",
7
- "pad_to_multiple_of": null,
8
- "pad_id": 49407,
9
- "pad_type_id": 0,
10
- "pad_token": "<|endoftext|>"
11
- },
12
  "added_tokens": [
13
  {
14
  "id": 49406,
 
1
  {
2
  "version": "1.0",
3
  "truncation": null,
4
+ "padding": null,
 
 
 
 
 
 
 
5
  "added_tokens": [
6
  {
7
  "id": 49406,
vqgan/config.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "_class_name": "PaellaVQModel",
3
- "_diffusers_version": "0.26.0.dev0",
4
- "_name_or_path": "StableCascade/vqgan",
5
  "bottleneck_blocks": 12,
6
  "embed_dim": 384,
7
  "in_channels": 3,
 
1
  {
2
  "_class_name": "PaellaVQModel",
3
+ "_diffusers_version": "0.27.0.dev0",
4
+ "_name_or_path": "warp-ai/wuerstchen",
5
  "bottleneck_blocks": 12,
6
  "embed_dim": 384,
7
  "in_channels": 3,
vqgan/diffusion_pytorch_model.bf16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ac32fab5177329dac907b2480c8c00aeefc712dfd92c2d52263a9c64b426b26
3
+ size 36825828
vqgan/diffusion_pytorch_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3ac32fab5177329dac907b2480c8c00aeefc712dfd92c2d52263a9c64b426b26
3
- size 36825828
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:052db8852c0d8b117e6d2a59ae3e0c7d7aaae3d00f247e392ef8e9837e11d6c4
3
+ size 73639568