license: mit
Paella
We are releasing a new Paella model which builds on top of our initial paper https://arxiv.org/abs/2211.07292. Paella is a text-to-image model that works in a quantized latent space and learns similarly to MUSE and Diffusion models. Since the paper-release we worked intensively to bring Paella to a similar level as other state-of-the-art models. With this release we are coming a step closer to that goal. However, our main intention is not to make the greatest text-to-image model out there (at least for now), it is to bring text-to-image models closer to people outside the field on a technical basis. For example, many models have codebases with many thousand lines of code, that make it pretty hard for people to dive into the code and easily understand it. And that is the contribution we are the most with Paella. The training and sampling code for Paella is minimalistic and can be understood in a few minutes, making further extensions, quick tests, idea testing etc. extremely fast. For instance, the entire sampling code can be written in just 12 lines of code.
How does Paella work?
Paella works in a quantized latent space, just like StableDiffusion etc., to reduce the computational power needed.
Images will be encoded to a smaller latent space and converted to visual tokens of shape h x w. Now during training,
these visual tokens will be noised, by replacing a random amount of tokens with other randomly selected tokens
from the codebook of the VQGAN. The noised image will be given to the model, along with a timestep and the conditional
information, which is text in our case. The model is tasked to predict the un-noised version of the tokens.
And that's it. The model is optimized with the CrossEntropy loss between the original tokens and the predicted tokens.
The amount of noise added during the training is just a linear schedule, meaning that we uniformly sample a percentage
between 0 and 100% and noise that amount of tokens.
Sampling is also extremely simple, we start with the entire image being random tokens. Then we feed the latent image,
the timestep and the condition into the model and let it predict the final image. The models outputs a distribution
over every token, which we sample from with standard multinomial sampling.
Since there are infinite possibilities for the result to look like, just doing a single step results in very basic
shapes without any details. That is why we add noise to the image again and feed it back to the model. And we repeat
that process for a number of times with less noise being added every time and slowly get our final image.
You can see how images emerge here.
The following is the entire sampling code needed to generate images:
def sample(model_inputs, latent_shape, unconditional_inputs, steps=12, renoise_steps=11, temperature=(0.7, 0.3), cfg=8.0):
with torch.inference_mode():
sampled = torch.randint(0, model.num_labels, size=latent_shape)
initial_noise = sampled.clone()
timesteps = torch.linspace(1.0, 0.0, steps+1)
temperatures = torch.linspace(temperature[0], temperature[1], steps)
for i, t in enumerate(timesteps[:steps]):
t = torch.ones(latent_shape[0]) * t
logits = model(sampled, t, **model_inputs)
if cfg:
logits = logits * cfg + model(sampled, t, **unconditional_inputs) * (1-cfg)
sampled = logits.div(temperatures[i]).softmax(dim=1).permute(0, 2, 3, 1).reshape(-1, logits.size(1))
sampled = torch.multinomial(sampled, 1)[:, 0].view(logits.size(0), *logits.shape[2:])
if i < renoise_steps:
t_next = torch.ones(latent_shape[0]) * timesteps[i+1]
sampled = model.add_noise(sampled, t_next, random_x=initial_noise)[0]
return sampled
Results
Since Paella is also conditioned on CLIP image embeddings the following things are also possible:Technical Details.
Model-Architecture: U-Net (Mix of....)
Dataset: Laion-A, Laion Aesthetic > 6.0
Training Steps: 1.3M
Batch Size: 2048
Resolution: 256
VQGAN Compression: f4
Condition: ByT5-XL (95%), CLIP-H Image Embedding (10%), CLIP-H Text Embedding (10%)
Optimizer: AdamW
Hardware: 128 A100 @ 80GB
Training Time: ~3 weeks
Learning Rate: 1e-4
More details on the approach, training and sampling can be found in paper and on GitHub.
Paper, Code Release
Paper: https://arxiv.org/abs/2211.07292
Code: https://github.com/dome272/Paella
Goal
So you see, there are no heavy math formulas or theorems needed to achieve good sampling qualities. Moreover, there are no constants such as alpha, beta, alpha_cum_prod etc. necessary as in diffusion models. This makes this method really suitable for people new to the field of generative AI. We hope we can set the foundation for further research in that direction and hope to contribute to a world where AI is accessible and can be understood by everyone.
Limitations & Conclusion
There are still many things to improve for Paella to get on par with standard diffusion models or to even outperform them. One primary thing we notice is that even though we only condition the model on CLIP image embedding 10% of the time, during inference the model heavily relies on the generated image embeddings by a prior model (mapping clip text embeddings to image embeddings as proposed in Dalle2). We counteract this by decreasing the importance of the image embeddings by reweighing the attention scores. There probably is a way to avoid this happening already in training. Other limitations such as lack of composition, text depiction, unawareness of concepts etc. could also be reduced by continuing the training for longer. As a reference, Paella has only seen as many images as SD 1.4 and due to earlier To conclude, this is still work in progress, but our first model that works a million times better than the first versions we trained months ago. We hope that more people become interested in this approach, since we believe it has a lot of potential to become much better than this and to enable new people to have an easy-to-understand introduction to the field of generative AI.