Papers
arxiv:2306.05427

Grounded Text-to-Image Synthesis with Attention Refocusing

Published on Jun 8, 2023
· Featured in Daily Papers on Jun 9, 2023
Authors:
,

Abstract

Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.

Community

This comment has been hidden

Amazing paper. I think grounded generation have not been explored enough in the text-to-image generation settings! Check out these two papers using groundings in video domain.

Grounded Video Editing: Ground-A-Video (https://ground-a-video.github.io/)
Grounded Video Generation: LLM-grounded VDM (https://llm-grounded-video-diffusion.github.io/)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.05427 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.05427 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.05427 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.