Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion
Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.
I have some ethical concerns regarding this article. The authors of this article have affiliations with organizations that have been subject to international sanctions due to their ties to the Russian government and military-industrial complex. These affiliations include:
- Skoltech - Subject to sanctions imposed by the USA, Switzerland, Australia, Ukraine, and New Zealand (source).
- Sber AI (Sberbank) - Facing sanctions from the EU, UK, USA, Canada, Switzerland, Australia, Japan, Ukraine, and New Zealand (source).
- AIRI - Owned by Sberbank.
It's important to note that Sberbank is the largest financial institution in Russia and is majority-owned by the GoR. It holds the largest market share of savings deposits in the country, is the main creditor of the Russian economy, and is considered by the GoR to be a systemically important financial institution.
Since the Russian Federation plans to allot a third of 2024 spending to defence, it's evident that Sberbank plays a pivotal role in financing the war against the Ukrainian people.
Today, I cannot remain silent; A Russian strike in my hometown of Kharkiv claimed the life of a 10-year-old boy.
I understand that some may argue that science is separate from politics, but in this context, such an assertion could be seen as a form of manipulation. I would like to highlight a relevant statement from a U.S. government press release regarding sanctions against Skoltech:
"Over the course of the last decade, Skoltech has had partnerships with numerous Russian defense enterprises – including Uralvagonzavod, United Engine Corporation, and United Aircraft Corporation – which have focused on developing composite materials for tanks, engines for ships, specialized materials for aircraft wings, and innovations for defense-related helicopters."
Considering this information, it is conceivable that some of the authors of this work could be associated with the development of resource used in the war against Ukraine. My perspective is rooted in the fact that Russia has increased its focus on military development — a trend reported by major international media.
Please take my concerns seriously. I believe that sharing and endorsing products and articles from such companies raises ethical questions. Furthermore, offering paid services like Inference Endpoints and others could potentially violate sanctions.
Thank you for your attention.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models (2023)
- VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation (2023)
- DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability (2023)
- Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation (2023)
- AltDiffusion: A Multilingual Text-to-Image Diffusion Model (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper