WARP is a fully open-source collective featuring: - Würstchen (W) - Arroz-Con-Cosas (A) - Risotto (R) - Paella (P)

Welcome to WARP. This is our little organization for multimodal generative models, focusing on the visual domain. We have been working with generative image models a lot and will soon work on video models as well. Our main team consists of:

A special thanks to the Huggingface Team for helping to bring our research to Diffusers! (Special thanks to Kashif, Patrick and Sayak!)

Feel free to join our Discord channel!


  • A simple & straightforward text-conditional image generation model that works on quantized latents.
  • More details can be found in the paper, the blog post and the YouTube video.
  • Only accessible through GitHub.
  • An efficient text-to-image model to train and use for inference. Achieves competetive performance to state-of-the-art methods, while needing only a fraction of the compute.
  • More details can be found in the paper.
  • Versions:


