Diffusers
Safetensors
English
AmusedPipeline
art
patrickvonplaten commited on
Commit
a02a1d6
·
1 Parent(s): bc59aa9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -21,7 +21,20 @@ tags:
21
 
22
  Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
23
 
24
- Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder clip instead of t5. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ## 1. Usage
27
 
@@ -220,6 +233,7 @@ Additionally, amused uses the smaller CLIP as its text encoder instead of T5 com
220
  Flash attention is enabled by default in the diffusers codebase through torch `F.scaled_dot_product_attention`
221
 
222
  ### torch.compile
 
223
  To use torch.compile, simply wrap the transformer in torch.compile i.e.
224
 
225
  ```python
 
21
 
22
  Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
23
 
24
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/97ca2Vqm7jBfCAzq20TtF.png)
25
+
26
+ *The diagram shows the training and inference pipelines for aMUSEd. aMUSEd consists
27
+ of three separately trained components: a pre-trained CLIP-L/14 text encoder, a VQ-GAN, and a
28
+ U-ViT. During training, the VQ-GAN encoder maps images to a 16x smaller latent resolution. The
29
+ proportion of masked latent tokens is sampled from a cosine masking schedule, e.g. cos(r · π
30
+ 2 )
31
+ with r ∼ Uniform(0, 1). The model is trained via cross-entropy loss to predict the masked tokens.
32
+ After the model is trained on 256x256 images, downsampling and upsampling layers are added, and
33
+ training is continued on 512x512 images. During inference, the U-ViT is conditioned on the text
34
+ encoder’s hidden states and iteratively predicts values for all masked tokens. The cosine masking
35
+ schedule determines a percentage of the most confident token predictions to be fixed after every
36
+ iteration. After 12 iterations, all tokens have been predicted and are decoded by the VQ-GAN into
37
+ image pixels.*
38
 
39
  ## 1. Usage
40
 
 
233
  Flash attention is enabled by default in the diffusers codebase through torch `F.scaled_dot_product_attention`
234
 
235
  ### torch.compile
236
+
237
  To use torch.compile, simply wrap the transformer in torch.compile i.e.
238
 
239
  ```python