File size: 6,164 Bytes
fc04ced 7f31ab6 fc04ced a59a734 ba8b3b2 a59a734 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 a59a734 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced a59a734 fc04ced 7f31ab6 a59a734 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 fc04ced 7f31ab6 682bf47 7f31ab6 fc04ced a59a734 fc04ced a59a734 7f31ab6 a59a734 fc04ced 7f31ab6 fc04ced 7f31ab6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
library_name: diffusers
---
# SPRIGHT-T2I Model Card
The SPRIGHT-T2I model is a text-to-image diffusion model with high spatial coherency. It was first introduced in [Getting it Right: Improving Spatial Consistency in Text-to-Image Models](https://),
authored by Agneet Chatterjee<sup>\*</sup>, Gabriela Ben Melech Stan<sup>*</sup>, Estelle Aflalo, Sayak Paul, Dhruba Ghosh,
Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang. _(<sup>\*</sup>denotes equal contributions)_
SPRIGHT-T2I model was finetuned from [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) on a subset
of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), which contains images and spatially focused
captions. Leveraging SPRIGHT, along with efficient training techniques, we achieve state-of-the art
performance in generating spatially accurate images from text.
## Table of contents
* [Model details](#model-details)
* [Usage](#usage)
* [Bias and Limitations](#bias-and-limitations)
* [Training](#training)
* [Evaluation](#evaluation)
* [Model Resources](#model-resources)
* [Citation](#citation)
The training code and more details available in [SPRIGHT-T2I GitHub Repository](https://github.com/orgs/SPRIGHT-T2I).
A demo is available on [Spaces](https://huggingface.co/spaces/SPRIGHT-T2I/SPRIGHT-T2I).
Use SPRIGHT-T2I with 🧨 [`diffusers`](https://huggingface.co/SPRIGHT-T2I/spright-t2i-sd2#usage).
## Model Details
- **Developed by:** Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang
- **Model type:** Diffusion-based text-to-image generation model with spatial coherency
- **Language(s) (NLP):** English
- **License:** [More Information Needed]
- **Finetuned from model:** [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1)
## Usage
Use the code below to run SPRIGHT-T2I seamlessly and effectively on [🤗's Diffusers library](https://github.com/huggingface/diffusers) .
```bash
pip install diffusers transformers accelerate -U
```
Running the pipeline:
```python
from diffusers import DiffusionPipeline
pipe_id = "SPRIGHT-T2I/spright-t2i-sd2"
pipe = DiffusionPipeline.from_pretrained(
pipe_id,
torch_dtype=torch.float16,
use_safetensors=True,
).to("cuda")
prompt = "a cute kitten is sitting in a dish on a table"
image = pipe(prompt).images[0]
image.save("kitten_sittin_in_a_dish.png")
```
<div align="center">
<img src="kitten_sitting_in_a_dish.png" width="300" alt="img">
</div><be>
Additional examples that emphasize spatial coherence:
<div align="center">
<img src="result_images/visor.png" width="1000" alt="img">
</div><br>
## Bias and Limitations
The biases and limitation as specified in [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) apply here as well.
## Training
#### Training Data
Our training and validation set are a subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), and consists of 444 and
50 images respectively, randomly sampled in a 50:50 split between LAION-Aesthetics and Segment Anything. Each image is paired with both, a general and a spatial caption
(from SPRIGHT). During fine-tuning, for each image, we randomly choose one of the given caption types in a 50:50 ratio.
We find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships.
Additionally, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency.
To construct our dataset, we focused on images with object counts larger than 18, utilizing the open-world image tagging model
[Recognize Anything](https://huggingface.co/xinyu1205/recognize-anything-plus-model) to achieve this constraint.
#### Training Procedure
Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP-ViT/H text-encoder as part of our training for 10,000 steps, with different learning rates.
- **Training regime:** fp16 mixed precision
- **Optimizer:** AdamW
- **Gradient Accumulations**: 1
- **Batch:** 4 x 8 = 32
- **UNet learning rate:** 0.00005
- **CLIP text-encoder learning rate:** 0.000001
- **Hardware:** Training was performed using NVIDIA RTX A6000 GPUs and Intel®Gaudi®2 AI accelerators.
## Evaluation
We find that compared to the baseline model SD 2.1, we largely improve the spatial accuracy, while also enhancing the non-spatial aspects associated with a text-to-image model.
The following table compares our SPRIGHT-T2I model with SD 2.1 across multiple spatial reasoning and image quality:
|Method |OA(%) ↑|VISOR-4(%) ↑|T2I-CompBench ↑|FID ↓|CCMD ↓|
|------------------|-------|------------|---------------|-----|------|
|SD v2.1 |47.83 |4.70 |0.1507 |21.646|1.060 |
|SPRIGHT-T2I (ours)|60.68 |16.15 |0.2133 |16.149|0.512 |
Our key findings are:
- Increased the Object Accuracy (OA) score by 26.86%, indicating that we are much better at generating objects mentioned in the input prompt
- Visor-4 score of 16.15% denotes that for a given input prompt, we consistently generate a spatially accurate image
- Improve on all aspects of the VISOR score while improving the ZS-FID and CMMD score on COCO-30K images by 23.74% and 51.69%, respectively
- Enhance the ability to generate 1 and 2 objects, along with generating the correct number of objects, as indicated by evaluation on the [GenEval](https://github.com/djghosh13/geneval) benchmark.
### Model Resources
- **Dataset**: [SPRIGHT Dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright)
- **Repository:** [SPRIGHT-T2I GitHub Repository](https://github.com/orgs/SPRIGHT-T2I)
- **Paper:** [Getting it Right: Improving Spatial Consistency in Text-to-Image Models](https://)
- **Demo:** [SPRIGHT-T2I on Spaces](https://huggingface.co/spaces/SPRIGHT-T2I/SPRIGHT-T2I)
- **Project Website**: [SPRIGHT Website](https://spright.github.io/)
## Citation
Coming soon
|