Abstract
Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.
Community
TL;DR: ROICtrl, built on ROI-Align and the newly proposed ROI-Unpool, can extend existing diffusion models and their add-ons (e.g., ControlNet, T2I-Adapter, IP-Adapter, ED-LoRA) to support controllable multi-instance generation.
Project page: https://roictrl.github.io/
Code will be released at: https://github.com/showlab/ROICtrl
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation (2024)
- OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction (2024)
- 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation (2024)
- Boundary Attention Constrained Zero-Shot Layout-To-Image Generation (2024)
- Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement (2024)
- Boosting Few-Shot Detection with Large Language Models and Layout-to-Image Synthesis (2024)
- Minority-Focused Text-to-Image Generation via Prompt Optimization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper