Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Published on Sep 27, 2023
· Featured in Daily Papers on Sep 28, 2023


Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on 1.1 billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of 82.9% compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred 68.4% and 71.3% of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.


Do you have a plan to release the high-quality image dataset for your quality-tuning?
It would be very helpful to the vision community.

Here is a AI-generated summary


The paper proposes quality-tuning, fine-tuning a pre-trained text-to-image model on a small set of exceptionally high-quality images, to align the model to generate highly aesthetic images.

The key insight is that fine-tuning on just a few thousand carefully selected, high-quality images can significantly improve the visual appeal of generated images without compromising generality.


  • Fine-tuning on just a few thousand carefully selected, high-quality images can significantly improve visual appeal.
  • Image quality is far more important than quantity for the fine-tuning data.
  • Following basic principles of photography leads to more aesthetic images across different styles.
  • Quality-tuning improves visual appeal without sacrificing generality of concepts or faithfulness.
  • Quality-tuning is effective for various architectures like pixel diffusion and masked transformers.
  • Quality-tuning is analogous to instruction tuning for language models - both require high-quality data.

The resulting quality-tuned model Emu significantly outperforms the pre-trained model and SOTA model SDXLv1.0 in visual appeal, preferred over 70% of the time.

Anyone implementing the channel increase for VAEs?

Another paper that could be just a blog post. Not sure where the novelty is.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Another paper that could be just a blog post. Not sure where the novelty is.

Fine tuning over quality images for text2img model has been done more than 1 year by community. Furthermore, the paper does not disclose any detail hyper paramter fof finetune.

This comment has been hidden

@jhou90 is there a chance that the pre-trained model and fine-tuning script could be made public?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 9