Papers
arxiv:2402.03286

Training-Free Consistent Text-to-Image Generation

Published on Feb 5
· Featured in Daily Papers on Feb 6
Authors:
,
,
,
,
,
,

Abstract

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

Community

We already do this with Dreambooth what is the difference advantage?

and still no code : https://consistory-paper.github.io/

We already do this with Dreambooth what is the difference advantage?

@MonsterMMORPG I think it is because there's no need for training! This happens in inference time - you can either keep the same characters generated by the model (say if you prompt "a photo of a dog" - the same dog is used across generations) - or bring in subjects using a technique called inversion - that requires no training/fine-tuning/the use of a LoRA like with Dreambooth!

Promising, excited for the code release 🔥

Will the code be released ?

We already do this with Dreambooth what is the difference advantage?

@MonsterMMORPG I think it is because there's no need for training! This happens in inference time - you can either keep the same characters generated by the model (say if you prompt "a photo of a dog" - the same dog is used across generations) - or bring in subjects using a technique called inversion - that requires no training/fine-tuning/the use of a LoRA like with Dreambooth!

Promising, excited for the code release 🔥

thanks. we will see if that will happen. instant id and face ip adapter are really nothing like dreambooth yet

Cooooooode 😁

Need a comfyUI workflow

This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

will this help creating consistent character in animated videos - dance moves? what about inpainting?

Any clue for the code release date, it will be really helpful

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.03286 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.03286 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.03286 in a Space README.md to link it from this page.

Collections including this paper 19