kashif HF staff commited on
Commit
9f41c55
1 Parent(s): af81976

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -9
README.md CHANGED
@@ -10,20 +10,19 @@ tags:
10
  <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>
11
 
12
  ## Würstchen - Overview
13
- Würstchen is diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
14
  computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make
15
- use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through it's novel design, we achieve a 42x spatial
16
- compression. This was unseen before, because common methods fail to faithfully reconstruct detailed images after 16x spatial compression already. Würstchen employs a
17
- two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
18
- A third model, Stage C, is learnt in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing
19
  also cheaper and faster inference.
20
 
21
  ## Würstchen - Decoder
22
- The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image
23
- and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
24
  decodes the latents into pixel space. Together, they achieve a spatial compression of 42.
25
 
26
- **Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, that are especially noticable to
27
  us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!
28
 
29
  ### Image Sizes
@@ -32,7 +31,7 @@ We also observed that the Prior (Stage C) adapts extremely fast to new resolutio
32
  <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/IfVsUDcP15OY-5wyLYKnQ.jpeg" width=1000>
33
 
34
  ## How to run
35
- This pipeline should be run together with a prior https://huggingface.co/warp-diffusion/wuerstchen-prior:
36
 
37
  ```py
38
  import torch
 
10
  <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>
11
 
12
  ## Würstchen - Overview
13
+ Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
14
  computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make
15
+ use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial
16
+ compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a
17
+ two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
18
+ A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing
19
  also cheaper and faster inference.
20
 
21
  ## Würstchen - Decoder
22
+ The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image, and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
 
23
  decodes the latents into pixel space. Together, they achieve a spatial compression of 42.
24
 
25
+ **Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, which are especially noticeable to
26
  us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!
27
 
28
  ### Image Sizes
 
31
  <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/IfVsUDcP15OY-5wyLYKnQ.jpeg" width=1000>
32
 
33
  ## How to run
34
+ This pipeline should be run together with a prior https://huggingface.co/warp-ai/wuerstchen-prior:
35
 
36
  ```py
37
  import torch