warp-ai
/

wuerstchen

Text-to-Image

Diffusers

Safetensors

WuerstchenDecoderPipeline

wuerstchen

Model card Files Files and versions Community

kashif HF staff commited on Sep 9, 2023

Commit

9f41c55

•

1 Parent(s): af81976

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -9

README.md CHANGED Viewed

@@ -10,20 +10,19 @@ tags:
 <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>
 ## Würstchen - Overview
-Würstchen is diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
 computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make
-use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through it's novel design, we achieve a 42x spatial
-compression. This was unseen before, because common methods fail to faithfully reconstruct detailed images after 16x spatial compression already. Würstchen employs a
-two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
-A third model, Stage C, is learnt in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing
 also cheaper and faster inference.
 ## Würstchen - Decoder
-The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image
-and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
 decodes the latents into pixel space. Together, they achieve a spatial compression of 42.
-**Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, that are especially noticable to
 us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!
 ### Image Sizes
@@ -32,7 +31,7 @@ We also observed that the Prior (Stage C) adapts extremely fast to new resolutio
 <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/IfVsUDcP15OY-5wyLYKnQ.jpeg" width=1000>
 ## How to run
-This pipeline should be run together with a prior https://huggingface.co/warp-diffusion/wuerstchen-prior:
 ```py
 import torch

 <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>
 ## Würstchen - Overview
+Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
 computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make
+use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial
+compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a
+two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
+A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing
 also cheaper and faster inference.
 ## Würstchen - Decoder
+The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image, and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
 decodes the latents into pixel space. Together, they achieve a spatial compression of 42.
+**Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, which are especially noticeable to
 us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!
 ### Image Sizes
 <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/IfVsUDcP15OY-5wyLYKnQ.jpeg" width=1000>
 ## How to run
+This pipeline should be run together with a prior https://huggingface.co/warp-ai/wuerstchen-prior:
 ```py
 import torch