diff --git a/.gitattributes b/.gitattributes
index a6344aac8c09253b3b630fb776ae94478aa0275b..4888a0050d431ce5f2cc27800d3d58bf6945b2db 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
+training/A[[:space:]]woman[[:space:]]working[[:space:]]on[[:space:]]a[[:space:]]laptop[[:space:]]in[[:space:]]\[V\][[:space:]]style.jpg filter=lfs diff=lfs merge=lfs -text
+assets/collage_small.png filter=lfs diff=lfs merge=lfs -text
+assets/collage_full.png filter=lfs diff=lfs merge=lfs -text
diff --git a/CITATION.cff b/CITATION.cff
new file mode 100644
index 0000000000000000000000000000000000000000..75623052260f3afaeb6fa684a1db74582d98b3dd
--- /dev/null
+++ b/CITATION.cff
@@ -0,0 +1,24 @@
+cff-version: 1.2.0
+title: 'Amused: An open MUSE model'
+message: >-
+ If you use this software, please cite it using the
+ metadata from this file.
+type: software
+authors:
+ - given-names: Suraj
+ family-names: Patil
+ - given-names: Berman
+ family-names: William
+ - given-names: Patrick
+ family-names: von Platen
+repository-code: 'https://github.com/huggingface/amused'
+keywords:
+ - deep-learning
+ - pytorch
+ - image-generation
+ - text2image
+ - image2image
+ - language-modeling
+ - masked-language-modeling
+license: Apache-2.0
+version: 0.12.1
\ No newline at end of file
diff --git a/README.md b/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c3408e53c2fa2ba9f71adbf546dfcf33164e7cbc
--- /dev/null
+++ b/README.md
@@ -0,0 +1,577 @@
+# amused
+
+![collage](./assets/collage_small.png)
+Images cherry-picked from 512 and 256 models. Images are degraded to load faster. See ./assets/collage_full.png for originals
+
+[[Paper - TODO]]()
+
+| Model | Params |
+|-------|--------|
+| [amused-256](https://huggingface.co/huggingface/amused-256) | 603M |
+| [amused-512](https://huggingface.co/huggingface/amused-512) | 608M |
+
+Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
+
+Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder clip instead of t5. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
+
+## 1. Usage
+
+### Text to image
+
+#### 256x256 model
+
+```python
+import torch
+from diffusers import AmusedPipeline
+
+pipe = AmusedPipeline.from_pretrained(
+ "huggingface/amused-256", variant="fp16", torch_dtype=torch.float16
+)
+pipe.vqvae.to(torch.float32) # vqvae is producing nans in fp16
+pipe = pipe.to("cuda")
+
+prompt = "cowboy"
+image = pipe(prompt, generator=torch.Generator('cuda').manual_seed(8)).images[0]
+image.save('text2image_256.png')
+```
+
+![text2image_256](./assets/text2image_256.png)
+
+#### 512x512 model
+
+```python
+import torch
+from diffusers import AmusedPipeline
+
+pipe = AmusedPipeline.from_pretrained(
+ "huggingface/amused-512", variant="fp16", torch_dtype=torch.float16
+)
+pipe.vqvae.to(torch.float32) # vqvae is producing nans n fp16
+pipe = pipe.to("cuda")
+
+prompt = "summer in the mountains"
+image = pipe(prompt, generator=torch.Generator('cuda').manual_seed(2)).images[0]
+image.save('text2image_512.png')
+```
+
+![text2image_512](./assets/text2image_512.png)
+
+### Image to image
+
+#### 256x256 model
+
+```python
+import torch
+from diffusers import AmusedImg2ImgPipeline
+from diffusers.utils import load_image
+
+pipe = AmusedImg2ImgPipeline.from_pretrained(
+ "huggingface/amused-256", variant="fp16", torch_dtype=torch.float16
+)
+pipe.vqvae.to(torch.float32) # vqvae is producing nans in fp16
+pipe = pipe.to("cuda")
+
+prompt = "apple watercolor"
+input_image = (
+ load_image(
+ "https://raw.githubusercontent.com/huggingface/amused/main/assets/image2image_256_orig.png"
+ )
+ .resize((256, 256))
+ .convert("RGB")
+)
+
+image = pipe(prompt, input_image, strength=0.7, generator=torch.Generator('cuda').manual_seed(3)).images[0]
+image.save('image2image_256.png')
+```
+
+![image2image_256_orig](./assets/image2image_256_orig.png) ![image2image_256](./assets/image2image_256.png)
+
+#### 512x512 model
+
+```python
+import torch
+from diffusers import AmusedImg2ImgPipeline
+from diffusers.utils import load_image
+
+pipe = AmusedImg2ImgPipeline.from_pretrained(
+ "huggingface/amused-512", variant="fp16", torch_dtype=torch.float16
+)
+pipe.vqvae.to(torch.float32) # vqvae is producing nans in fp16
+pipe = pipe.to("cuda")
+
+prompt = "winter mountains"
+input_image = (
+ load_image(
+ "https://raw.githubusercontent.com/huggingface/amused/main/assets/image2image_512_orig.png"
+ )
+ .resize((512, 512))
+ .convert("RGB")
+)
+
+image = pipe(prompt, input_image, generator=torch.Generator('cuda').manual_seed(15)).images[0]
+image.save('image2image_512.png')
+```
+
+![image2image_512_orig](./assets/image2image_512_orig.png) ![image2image_512](./assets/image2image_512.png)
+
+### Inpainting
+
+#### 256x256 model
+
+```python
+import torch
+from diffusers import AmusedInpaintPipeline
+from diffusers.utils import load_image
+from PIL import Image
+
+pipe = AmusedInpaintPipeline.from_pretrained(
+ "huggingface/amused-256", variant="fp16", torch_dtype=torch.float16
+)
+pipe.vqvae.to(torch.float32) # vqvae is producing nans in fp16
+pipe = pipe.to("cuda")
+
+prompt = "a man with glasses"
+input_image = (
+ load_image(
+ "https://raw.githubusercontent.com/huggingface/amused/main/assets/inpainting_256_orig.png"
+ )
+ .resize((256, 256))
+ .convert("RGB")
+)
+mask = (
+ load_image(
+ "https://raw.githubusercontent.com/huggingface/amused/main/assets/inpainting_256_mask.png"
+ )
+ .resize((256, 256))
+ .convert("L")
+)
+
+for seed in range(20):
+ image = pipe(prompt, input_image, mask, generator=torch.Generator('cuda').manual_seed(seed)).images[0]
+ image.save(f'inpainting_256_{seed}.png')
+
+```
+
+![inpainting_256_orig](./assets/inpainting_256_orig.png) ![inpainting_256_mask](./assets/inpainting_256_mask.png) ![inpainting_256](./assets/inpainting_256.png)
+
+#### 512x512 model
+
+```python
+import torch
+from diffusers import AmusedInpaintPipeline
+from diffusers.utils import load_image
+
+pipe = AmusedInpaintPipeline.from_pretrained(
+ "huggingface/amused-512", variant="fp16", torch_dtype=torch.float16
+)
+pipe.vqvae.to(torch.float32) # vqvae is producing nans in fp16
+pipe = pipe.to("cuda")
+
+prompt = "fall mountains"
+input_image = (
+ load_image(
+ "https://raw.githubusercontent.com/huggingface/amused/main/assets/inpainting_512_orig.jpeg"
+ )
+ .resize((512, 512))
+ .convert("RGB")
+)
+mask = (
+ load_image(
+ "https://raw.githubusercontent.com/huggingface/amused/main/assets/inpainting_512_mask.png"
+ )
+ .resize((512, 512))
+ .convert("L")
+)
+image = pipe(prompt, input_image, mask, generator=torch.Generator('cuda').manual_seed(0)).images[0]
+image.save('inpainting_512.png')
+```
+
+![inpainting_512_orig](./assets/inpainting_512_orig.jpeg)
+![inpainting_512_mask](./assets/inpainting_512_mask.png)
+![inpainting_512](./assets/inpainting_512.png)
+
+## 2. Performance
+
+Amused inherits performance benefits from original [muse](https://arxiv.org/pdf/2301.00704.pdf).
+
+1. Parallel decoding: The model follows a denoising schedule that aims to unmask some percent of tokens at each denoising step. At each step, all masked tokens are predicted, and some number of tokens that the network is most confident about are unmasked. Because multiple tokens are predicted at once, we can generate a full 256x256 or 512x512 image in around 12 steps. In comparison, an autoregressive model must predict a single token at a time. Note that a 256x256 image with the 16x downsampled VAE that muse uses will have 256 tokens.
+
+2. Fewer sampling steps: Compared to many diffusion models, muse requires fewer samples.
+
+Additionally, amused uses the smaller CLIP as its text encoder instead of T5 compared to muse. Amused is also smaller with ~600M params compared the largest 3B param muse model. Note that being smaller, amused produces comparably lower quality results.
+
+![a100_bs_1](./assets/a100_bs_1.png)
+![a100_bs_8](./assets/a100_bs_8.png)
+![4090_bs_1](./assets/4090_bs_1.png)
+![4090_bs_8](./assets/4090_bs_8.png)
+
+### Muse performance knobs
+
+| | Uncompiled Transformer + regular attention | Uncompiled Transformer + flash attention (ms) | Compiled Transformer (ms) | Speed Up |
+|---------------------|--------------------------------------------|-------------------------|----------------------|----------|
+| 256 Batch Size 1 | 594.7 | 507.7 | 212.1 | 58% |
+| 512 Batch Size 1 | 637 | 547 | 249.9 | 54% |
+| 256 Batch Size 8 | 719 | 628.6 | 427.8 | 32% |
+| 512 Batch Size 8 | 1000 | 917.7 | 703.6 | 23% |
+
+Flash attention is enabled by default in the diffusers codebase through torch `F.scaled_dot_product_attention`
+
+### torch.compile
+To use torch.compile, simply wrap the transformer in torch.compile i.e.
+
+```python
+pipe.transformer = torch.compile(pipe.transformer)
+```
+
+Full snippet:
+
+```python
+import torch
+from diffusers import AmusedPipeline
+
+pipe = AmusedPipeline.from_pretrained(
+ "huggingface/amused-256", variant="fp16", torch_dtype=torch.float16
+)
+
+# HERE use torch.compile
+pipe.transformer = torch.compile(pipe.transformer)
+
+pipe.vqvae.to(torch.float32) # vqvae is producing nans in fp16
+pipe = pipe.to("cuda")
+
+prompt = "cowboy"
+image = pipe(prompt, generator=torch.Generator('cuda').manual_seed(8)).images[0]
+image.save('text2image_256.png')
+```
+
+## 3. Training
+
+Amused can be finetuned on simple datasets relatively cheaply and quickly. Using 8bit optimizers, lora, and gradient accumulation, amused can be finetuned with as little as 5.5 GB. Here are a set of examples for finetuning amused on some relatively simple datasets. These training recipies are aggressively oriented towards minimal resources and fast verification -- i.e. the batch sizes are quite low and the learning rates are quite high. For optimal quality, you will probably want to increase the batch sizes and decrease learning rates.
+
+All training examples use fp16 mixed precision and gradient checkpointing. We don't show 8 bit adam + lora as its about the same memory use as just using lora (bitsandbytes uses full precision optimizer states for weights below a minimum size).
+
+### Finetuning the 256 checkpoint
+
+These examples finetune on this [nouns](https://huggingface.co/datasets/m1guelpf/nouns) dataset.
+
+Example results:
+
+![noun1](./assets/noun1.png) ![noun2](./assets/noun2.png) ![noun3](./assets/noun3.png)
+
+#### Full finetuning
+
+Batch size: 8, Learning rate: 1e-4, Gives decent results in 750-1000 steps
+
+| Batch Size | Gradient Accumulation Steps | Effective Total Batch Size | Memory Used |
+|------------|-----------------------------|------------------|-------------|
+| 8 | 1 | 8 | 19.7 GB |
+| 4 | 2 | 8 | 18.3 GB |
+| 1 | 8 | 8 | 17.9 GB |
+
+```sh
+accelerate launch training/training.py \
+ --output_dir