spright-t2i-sd2 / README.md

update Readme

2205044 verified 7 months ago

5.58 kB

	---
	library_name: diffusers
	---

	# SPRIGHT-T2I Model Card

	The SPRIGHT-T2I model is a text-to-image diffusion model with high spatial coherency. It was first introduced in [Getting it Right: Improving Spatial Consistency in Text-to-Image Models](https://), authored by Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo,
	Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang.

	SPRIGHT-T2I model was finetuned from stable diffusion v2.1 on a customized subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), which contains images and spatially focused captions. Leveraging SPRIGHT, along with efficient training techniques, we achieve state-of-the art performance in generating spatially accurate images from text.

	The training code and more details available in [SPRIGHT-T2I GitHub Repository](https://github.com/orgs/SPRIGHT-T2I).

	A demo is available on [Spaces](https://huggingface.co/spaces/SPRIGHT-T2I/SPRIGHT-T2I).

	Use SPRIGHT-T2I with 🧨 [`diffusers`](https://huggingface.co/SPRIGHT-T2I/spright-t2i-sd2#usage).


	## Model Details

	- Developed by: Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang
	- Model type: Diffusion-based text-to-image generation model with spatial coherency
	- Language(s) (NLP): English
	- License: [More Information Needed]
	- Finetuned from model: [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1)

	## Usage

	Use the code below to run SPRIGHT-T2I seamlessly and effectively on [🤗's Diffusers library](https://github.com/huggingface/diffusers) .

	```bash
	pip install diffusers transformers accelerate scipy safetensors
	```

	Running the pipeline:

	```python
	from diffusers import DiffusionPipeline

	pipe_id = "SPRIGHT-T2I/spright-t2i-sd2"

	pipe = DiffusionPipeline.from_pretrained(
	pipe_id,
	torch_dtype=torch.float16,
	use_safetensors=True,
	).to("cuda")

	prompt = "a cute kitten is sitting in a dish on a table"
	image = pipe(prompt).images[0]

	image.save("kitten_sittin_in_a_dish.png")
	```

	<img src="kitten_sitting_in_a_dish.png" width="300" alt="img">

	Additional examples that emphasize spatial coherence:
	<img src="result_images/visor.png" width="1000" alt="img">

	## Uses, Bias and Limitations

	The [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) Uses, limitations and biases apply.

	## Training Details

	### Training Data

	Our training and validation set are a customized subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), and consists of 444 and
	50 images respectively, randomly sampled in a 50:50 split between LAION-Aesthetics and Segment Anything. Each image is paired with both, a general and a spatial caption
	(from SPRIGHT). During fine-tuning, for each image, we randomly choose one of the given caption types in a 50:50 ratio.

	We find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships.
	Additionally, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency.
	To construct our dataset, we focused on images with object counts larger than 18, utilizing the open-world image tagging model
	[Recognize Anything](https://huggingface.co/xinyu1205/recognize-anything-plus-model) to achieve this constraint.

	### Training Procedure

	Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP-ViT/H text-encoder as part of our training for 10,000 steps, with different learning rates.

	- Training regime: fp16 mixed precision
	- Optimizer: AdamW
	- Gradient Accumulations: 1
	- Batch: 4 x 8 = 32
	- UNet learning rate: 0.00005
	- CLIP text-encoder learning rate: 0.000001
	- Hardware: Training was performed using Intel Gaudi 2 and NVIDIA RTX A6000 GPUs


	## Evaluation

	We find that compared to the baseline model SD 2.1, we largely improve the spatial accuracy, while also enhancing the non-spatial aspects associated with a text-to-image model.

	The following table compares our SPRIGHT-T2I model with SD 2.1 across multiple spatial reasoning and image quality:

	\|Method \|OA(%) ↑\|VISOR-4(%) ↑\|T2I-CompBench ↑\|FID ↓\|CCMD ↓\|
	\|------------------\|-------\|------------\|---------------\|-----\|------\|
	\|SD v2.1 \|47.83 \|4.70 \|0.1507 \|27.39\|1.060 \|
	\|SPRIGHT-T2I (ours)\|60.68 \|16.15 \|0.2133 \|27.82\|0.512 \|

	Our key findings are:
	- Increased the Object Accuracy (OA) score by 26.86%, indicating that we are much better at generating objects mentioned in the input prompt
	- Visor-4 score of 16.15% denotes that for a given input prompt, we consistently generate a spatially accurate image
	- Improve on all aspects of the VISOR score while improving the ZS-FID and CMMD score on COCO-30K images by 23.74% and 51.69%, respectively
	- Enhance the ability to generate 1 and 2 objects, along with generating the correct number of objects, as indicated by evaluation on the [GenEval](https://github.com/djghosh13/geneval) benchmark.

	### Model Sources

	- Repository: [SPRIGHT-T2I GitHub Repository](https://github.com/orgs/SPRIGHT-T2I)
	- Paper: [Getting it Right: Improving Spatial Consistency in Text-to-Image Models](https://)
	- Demo: [SPRIGHT-T2I on Spaces](https://huggingface.co/spaces/SPRIGHT-T2I/SPRIGHT-T2I)

	## Citation

	Coming soon