File size: 3,588 Bytes
499a66e
 
7a6d5a0
499a66e
7a6d5a0
fd6b1e9
7a6d5a0
81a553d
 
7a6d5a0
 
81a553d
 
 
 
 
 
 
cf4db73
 
 
 
 
 
 
 
 
 
 
 
7a6d5a0
50ebf03
 
 
 
 
 
7a6d5a0
 
81a553d
cf4db73
 
81a553d
 
 
 
7a6d5a0
81a553d
cf4db73
81a553d
 
 
cf4db73
81a553d
 
 
 
cf4db73
81a553d
 
 
 
 
 
 
8185317
81a553d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf4db73
81a553d
 
 
cf4db73
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: mit
library_name: diffusers
---

# Stage-A-ft-HQ

`stage-a-ft-hq` is a version of [Würstchen](https://huggingface.co/warp-ai/wuerstchen)'s **Stage A** that was finetuned to have slightly-nicer-looking textures.

`stage-a-ft-hq` works with any Würstchen-derived model (including [Stable Cascade](https://huggingface.co/stabilityai/stable-cascade)).

## Example comparison

| Stable Cascade                    | Stable Cascade + `stage-a-ft-hq`   |
| --------------------------------- | ---------------------------------- |
| ![](example_baseline.png)         | ![](example_finetuned.png)         |
| ![](example_baseline_closeup.png) | ![](example_finetuned_closeup.png) |

## Explanation

Image generators like Würstchen and Stable Cascade create images via a multi-stage process.
Stage A is the ultimate stage, responsible for rendering out full-resolution, human-interpretable images (based on the output from prior stages).

The original Stage A tends to render slightly-smoothed-out images with a distinctive noise pattern on top.

`stage-a-ft-hq` was finetuned briefly on a high-quality dataset in order to reduce these artifacts.

## Suggested Settings

To generate highly detailed images, you probably want to use `stage-a-ft-hq` (which improves very fine detail) in combination with a large Stage B step count (which [improves mid-level detail](https://old.reddit.com/r/StableDiffusion/comments/1ar359h/cascade_can_generate_directly_at_1536x1536_and/kqhjtk5/)).

## ComfyUI Usage

Download the file [`stage_a_ft_hq.safetensors`](https://huggingface.co/madebyollin/stage-a-ft-hq/resolve/main/stage_a_ft_hq.safetensors?download=true), put it in `ComfyUI/models/vae`, and make sure your VAE Loader node is loading this file.

(`stage_a_ft_hq.safetensors` includes the [special key](https://github.com/comfyanonymous/ComfyUI/blob/d91f45ef280a5acbdc22f3cc757f8fdbb254261b/comfy/sd.py#L181) that ComfyUI uses to auto-identify Stage A model files)

## 🧨 Diffusers Usage

⚠️ As of 2024-02-17, Stable Cascade's [PR](https://github.com/huggingface/diffusers/pull/6487) is still under review.
I've only tested Stable Cascade with this particular version of the PR:

```bash
pip install --upgrade --force-reinstall https://github.com/kashif/diffusers/archive/a3dc21385b7386beb3dab3a9845962ede6765887.zip
```

```py
import torch
device = "cuda"

# Load the Stage-A-ft-HQ model
from diffusers.pipelines.wuerstchen import PaellaVQModel
stage_a_ft_hq = PaellaVQModel.from_pretrained("madebyollin/stage-a-ft-hq", torch_dtype=torch.float16).to(device)

# Load the normal Stable Cascade pipeline
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

num_images_per_prompt = 1

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16).to(device)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade",  torch_dtype=torch.float16).to(device)

# Swap in the Stage-A-ft-HQ model
decoder.vqgan = stage_a_ft_hq

prompt = "Photograph of Seattle streets on a snowy winter morning"
negative_prompt = ""

prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=num_images_per_prompt,
    num_inference_steps=20
)
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings.half(),
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=20
).images

display(decoder_output[0])
```