Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on 1.1 billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of 82.9% compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred 68.4% and 71.3% of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.
Do you have a plan to release the high-quality image dataset for your quality-tuning?
It would be very helpful to the vision community.
Here is a AI-generated summary
The paper proposes quality-tuning, fine-tuning a pre-trained text-to-image model on a small set of exceptionally high-quality images, to align the model to generate highly aesthetic images.
The key insight is that fine-tuning on just a few thousand carefully selected, high-quality images can significantly improve the visual appeal of generated images without compromising generality.
- Fine-tuning on just a few thousand carefully selected, high-quality images can significantly improve visual appeal.
- Image quality is far more important than quantity for the fine-tuning data.
- Following basic principles of photography leads to more aesthetic images across different styles.
- Quality-tuning improves visual appeal without sacrificing generality of concepts or faithfulness.
- Quality-tuning is effective for various architectures like pixel diffusion and masked transformers.
- Quality-tuning is analogous to instruction tuning for language models - both require high-quality data.
The resulting quality-tuned model Emu significantly outperforms the pre-trained model and SOTA model SDXLv1.0 in visual appeal, preferred over 70% of the time.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis (2023)
- IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models (2023)
- Dense Text-to-Image Generation with Attention Modulation (2023)
- PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models (2023)
- The Five-Dollar Model: Generating Game Maps and Sprites from Sentence Embeddings (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Another paper that could be just a blog post. Not sure where the novelty is.
Fine tuning over quality images for text2img model has been done more than 1 year by community. Furthermore, the paper does not disclose any detail hyper paramter fof finetune.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper