InstanceDiffusion: Instance-level Control for Image Generation

We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bounding boxes or intricate instance segmentation masks, and combinations thereof. Compared to the previous SOTA, InstanceDiffusion achieves 2.0 times higher AP50 for box inputs and 1.7 times higher IoU for mask inputs.

InstanceDiffusion: Instance-level Control for Image Generation
Xudong Wang, Trevor Darrell, Saketh Rambhatla, Rohit Girdhar, Ishan Misra
GenAI, Meta; BAIR, UC Berkeley
Preprint

Model Sources

Repository: [https://github.com/frank-xwang/InstanceDiffusion]
Paper: [https://arxiv.org/pdf/2402.03290.pdf]
Project Page: [https://people.eecs.berkeley.edu/~xdwang/projects/InstDiff/]

Model Description

InstanceDiffusion enhances text-to-image models by providing additional instance-level control. In additon to a global text prompt, InstanceDiffusion allows for paired instance-level prompts and their locations (e.g. points, boxes, scribbles or instance masks) to be specified when generating images. We add our proposed learnable UniFusion blocks to handle the additional per-instance conditioning. UniFusion fuses the instance conditioning with the backbone and modulate its features to enable instance conditioned image generation. Additionally, we propose ScaleU blocks that improve the UNet’s ability to respect instance-conditioning by rescaling the skip-connection and backbone feature maps produced in the UNet. At inference, we propose Multi-instance Sampler which reduces information leakage across multiple instances.

Please check our paper and project page for more details.

Citation

If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.

@misc{wang2024instancediffusion,
      title={InstanceDiffusion: Instance-level Control for Image Generation}, 
      author={Xudong Wang and Trevor Darrell and Sai Saketh Rambhatla and Rohit Girdhar and Ishan Misra},
      year={2024},
      eprint={2402.03290},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Disclaimer

This repository represents a re-implementation of InstanceDiffusion conducted by the first author during his time at UC Berkeley. Minor performance discrepancies may exist (differences of ~1% in AP) compared to the results reported in the original paper. The goal of this repository is to replicate the original paper's findings and insights, primarily for academic and research purposes.