File size: 6,213 Bytes
dc912cd
88d5e4e
 
 
 
bf023f7
 
 
 
 
 
 
dc912cd
88d5e4e
913df46
88d5e4e
 
99fc217
88d5e4e
99fc217
 
 
 
 
 
 
 
 
 
88d5e4e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99fc217
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88d5e4e
 
99fc217
 
 
ea31dd5
 
99fc217
7a0c7c5
c72cde0
 
 
4daea2d
 
 
 
7a0c7c5
 
 
 
 
 
 
c72cde0
 
7a0c7c5
 
 
 
c72cde0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a0c7c5
88d5e4e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
pipeline_tag: text-to-image
license: other
license_name: stable-cascade-nc-community
license_link: LICENSE
language:
- fa
metrics:
- bleu
library_name: bertopic
tags:
- art
---

# Stable Cascade

<!-- Provide a quick summary of what the model is/does. -->
<img src="figures/collage_1.jpg" width="800">

This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main 
difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this 
important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes. 
How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being 
encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a 
1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the 
highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable 
Diffusion 1.5. <br> <br>
Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.

## Model Details

### Model Description

Stable Cascade is a diffusion model trained to generate images given a text prompt.

- **Developed by:** Stability AI
- **Funded by:** Stability AI
- **Model type:** Generative text-to-image model

### Model Sources

For research purposes, we recommend our `StableCascade` Github repository (https://github.com/Stability-AI/StableCascade).

- **Repository:** https://github.com/Stability-AI/StableCascade
- **Paper:** https://openreview.net/forum?id=gU58d5QeGv

### Model Overview
Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
hence the name "Stable Cascade".
Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion. 
However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a 
spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves 
a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the 
image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible 
for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.

<img src="figures/model-overview.jpg" width="600">

For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with 
a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was 
put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve 
great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the 
best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to 
its small size.

## Evaluation
<img height="300" src="figures/comparison.png"/>
According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all 
comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and 
aesthetic prompts. Specifically, Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference 
steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps).

## Code Example

**⚠️ Important**: For the code below to work, you have to install `diffusers` from this branch while the PR is WIP.

```shell
pip install git+https://github.com/kashif/diffusers.git@wuerstchen-v3
```

```python
import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

device = "cuda"
num_images_per_prompt = 2

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16).to(device)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade",  torch_dtype=torch.float16).to(device)

prompt = "Anthropomorphic cat dressed as a pilot"
negative_prompt = ""

prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=num_images_per_prompt,
    num_inference_steps=20
)
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings.half(),
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=10
).images

#Now decoder_output is a list with your PIL images
```

## Uses

### Direct Use

The model is intended for research purposes for now. Possible research areas and tasks include

- Research on generative models.
- Safe deployment of models which have the potential to generate harmful content.
- Probing and understanding the limitations and biases of generative models.
- Generation of artworks and use in design and other artistic processes.
- Applications in educational or creative tools.

Excluded uses are described below.

### Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, 
and therefore using the model to generate such content is out-of-scope for the abilities of this model.
The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).

## Limitations and Bias

### Limitations
- Faces and people in general may not be generated properly.
- The autoencoding part of the model is lossy.


### Recommendations

The model is intended for research purposes only.

## How to Get Started with the Model

Check out https://github.com/Stability-AI/StableCascade