dome272 commited on
Commit
1bc181e
1 Parent(s): a8ad3eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md CHANGED
@@ -1,3 +1,106 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ ![](https://user-images.githubusercontent.com/61938694/231021615-38df0a0a-d97e-4f7a-99d9-99952357b4b1.png)
5
+ ## Paella
6
+ We are releasing a new Paella model which builds on top of our initial paper https://arxiv.org/abs/2211.07292.
7
+ Paella is a text-to-image model that works in a quantized latent space and learns similarly to MUSE and Diffusion models.
8
+ Since the paper-release we worked intensively to bring Paella to a similar level as other
9
+ state-of-the-art models. With this release we are coming a step closer to that goal. However, our main intention is not
10
+ to make the greatest text-to-image model out there (at least for now), it is to bring text-to-image models closer
11
+ to people outside the field on a technical basis. For example, many models have codebases with many thousand lines of
12
+ code, that make it pretty hard for people to dive into the code and easily understand it. And that is the contribution
13
+ we are the most with Paella. The training and sampling code for Paella is minimalistic and can be understood in a few
14
+ minutes, making further extensions, quick tests, idea testing etc. extremely fast. For instance, the entire
15
+ sampling code can be written in just **12 lines** of code.
16
+
17
+ ### How does Paella work?
18
+ Paella works in a quantized latent space, just like StableDiffusion etc., to reduce the computational power needed.
19
+ Images will be encoded to a smaller latent space and converted to visual tokens of shape *h x w*. Now during training,
20
+ these visual tokens will be noised, by replacing a random amount of tokens with other randomly selected tokens
21
+ from the codebook of the VQGAN. The noised image will be given to the model, along with a timestep and the conditional
22
+ information, which is text in our case. The model is tasked to predict the un-noised version of the tokens.
23
+ And that's it. The model is optimized with the CrossEntropy loss between the original tokens and the predicted tokens.
24
+ The amount of noise added during the training is just a linear schedule, meaning that we uniformly sample a percentage
25
+ between 0 and 100% and noise that amount of tokens.<br><br>
26
+
27
+ <figure>
28
+ <img src="https://user-images.githubusercontent.com/61938694/231248435-d21170c1-57b4-4a8f-90a6-62cf3e7effcd.png" width="400">
29
+ <figcaption>Images are noised and then fed to the model during training.</figcaption>
30
+ </figure>
31
+
32
+
33
+ Sampling is also extremely simple, we start with the entire image being random tokens. Then we feed the latent image,
34
+ the timestep and the condition into the model and let it predict the final image. The models outputs a distribution
35
+ over every token, which we sample from with standard multinomial sampling.
36
+ Since there are infinite possibilities for the result to look like, just doing a single step results in very basic
37
+ shapes without any details. That is why we add noise to the image again and feed it back to the model. And we repeat
38
+ that process for a number of times with less noise being added every time and slowly get our final image.
39
+ You can see how images emerge [here](https://user-images.githubusercontent.com/61938694/231252449-d9ac4d15-15ef-4aed-a0de-91fa8746a415.png).<br>
40
+ The following is the entire sampling code needed to generate images:
41
+ ```python
42
+ def sample(model_inputs, latent_shape, unconditional_inputs, steps=12, renoise_steps=11, temperature=(0.7, 0.3), cfg=8.0):
43
+ with torch.inference_mode():
44
+ sampled = torch.randint(0, model.num_labels, size=latent_shape)
45
+ initial_noise = sampled.clone()
46
+ timesteps = torch.linspace(1.0, 0.0, steps+1)
47
+ temperatures = torch.linspace(temperature[0], temperature[1], steps)
48
+ for i, t in enumerate(timesteps[:steps]):
49
+ t = torch.ones(latent_shape[0]) * t
50
+
51
+ logits = model(sampled, t, **model_inputs)
52
+ if cfg:
53
+ logits = logits * cfg + model(sampled, t, **unconditional_inputs) * (1-cfg)
54
+ sampled = logits.div(temperatures[i]).softmax(dim=1).permute(0, 2, 3, 1).reshape(-1, logits.size(1))
55
+ sampled = torch.multinomial(sampled, 1)[:, 0].view(logits.size(0), *logits.shape[2:])
56
+
57
+ if i < renoise_steps:
58
+ t_next = torch.ones(latent_shape[0]) * timesteps[i+1]
59
+ sampled = model.add_noise(sampled, t_next, random_x=initial_noise)[0]
60
+ return sampled
61
+ ```
62
+
63
+ ### Results
64
+ <img src="https://user-images.githubusercontent.com/61938694/231598512-2410c172-5a9d-43f4-947c-6ff7eaee77e7.png">
65
+ Since Paella is also conditioned on CLIP image embeddings the following things are also possible:<br><br>
66
+ <img src="https://user-images.githubusercontent.com/61938694/231278319-16551a8d-bfd1-49c9-b604-c6da3955a6d4.png">
67
+ <img src="https://user-images.githubusercontent.com/61938694/231287637-acd0b9b2-90c7-4518-9b9e-d7edefc6c3af.png">
68
+ <img src="https://user-images.githubusercontent.com/61938694/231287119-42fe496b-e737-4dc5-8e53-613bdba149da.png">
69
+
70
+ ### Technical Details.
71
+ Model-Architecture: U-Net (Mix of....) <br>
72
+ Dataset: Laion-A, Laion Aesthetic > 6.0 <br>
73
+ Training Steps: 1.3M <br>
74
+ Batch Size: 2048 <br>
75
+ Resolution: 256 <br>
76
+ VQGAN Compression: f4 <br>
77
+ Condition: ByT5-XL (95%), CLIP-H Image Embedding (10%), CLIP-H Text Embedding (10%)
78
+ Optimizer: AdamW
79
+ Hardware: 128 A100 @ 80GB <br>
80
+ Training Time: ~3 weeks <br>
81
+ Learning Rate: 1e-4 <br>
82
+ More details on the approach, training and sampling can be found in paper and on GitHub.
83
+
84
+ ### Paper, Code Release
85
+ Paper: https://arxiv.org/abs/2211.07292 <br>
86
+ Code: https://github.com/dome272/Paella <br>
87
+
88
+ ### Goal
89
+ So you see, there are no heavy math formulas or theorems needed to achieve good sampling qualities. Moreover,
90
+ there are no constants such as alpha, beta, alpha_cum_prod etc. necessary as in diffusion models. This makes this
91
+ method really suitable for people new to the field of generative AI. We hope we can set the foundation for further
92
+ research in that direction and hope to contribute to a world where AI is accessible and can be understood by everyone.
93
+
94
+ ### Limitations & Conclusion
95
+ There are still many things to improve for Paella to get on par with standard diffusion models or to even outperform
96
+ them. One primary thing we notice is that even though we only condition the model on CLIP image embedding 10% of the
97
+ time, during inference the model heavily relies on the generated image embeddings by a prior model (mapping clip text
98
+ embeddings to image embeddings as proposed in Dalle2). We counteract this by decreasing the importance of the image
99
+ embeddings by reweighing the attention scores. There probably is a way to avoid this happening already in training.
100
+ Other limitations such as lack of composition, text depiction, unawareness of concepts etc. could also be reduced by
101
+ continuing the training for longer. As a reference, Paella has only seen as many images as SD 1.4 and due to earlier
102
+ To conclude, this is still work in progress, but our first model that works a million times better than the first
103
+ versions we trained months ago. We hope that more people become interested in this approach, since we believe it has
104
+ a lot of potential to become much better than this and to enable new people to have an easy-to-understand introduction
105
+ to the field of generative AI.
106
+