SPRIGHT-T2I
/

spright-t2i-sd2

Text-to-Image

Diffusers

Safetensors

StableDiffusionPipeline

Inference Endpoints

Model card Files Files and versions Community

gabsm commited on Mar 28

Commit

0a63bc8

•

1 Parent(s): 2205044

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -7,7 +7,7 @@ library_name: diffusers
 The SPRIGHT-T2I model is a text-to-image diffusion model with high spatial coherency. It was first introduced in [Getting it Right: Improving Spatial Consistency in Text-to-Image Models](https://), authored by Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo,
 Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang.
-SPRIGHT-T2I model was finetuned from stable diffusion v2.1 on a customized subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), which contains images and spatially focused captions. Leveraging SPRIGHT, along with efficient training techniques, we achieve state-of-the art performance in generating spatially accurate images from text.
 The training code and more details available in [SPRIGHT-T2I GitHub Repository](https://github.com/orgs/SPRIGHT-T2I).
@@ -56,15 +56,15 @@ image.save("kitten_sittin_in_a_dish.png")
 Additional examples that emphasize spatial coherence:
 <img src="result_images/visor.png" width="1000" alt="img">
-## Uses, Bias and Limitations
-The [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) Uses, limitations and biases apply.
-## Training Details
-### Training Data
-Our training and validation set are a customized subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), and consists of 444 and
 50 images respectively, randomly sampled in a 50:50 split between LAION-Aesthetics and Segment Anything. Each image is paired with both, a general and a spatial caption
 (from SPRIGHT). During fine-tuning, for each image, we randomly choose one of the given caption types in a 50:50 ratio.
@@ -73,7 +73,7 @@ Additionally, we find that training on images containing a large number of objec
 To construct our dataset, we focused on images with object counts larger than 18, utilizing the open-world image tagging model
 [Recognize Anything](https://huggingface.co/xinyu1205/recognize-anything-plus-model) to achieve this constraint.
-### Training Procedure
 Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP-ViT/H text-encoder as part of our training for 10,000 steps, with different learning rates.
@@ -83,7 +83,7 @@ Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP
 - **Batch:** 4 x 8 = 32
 - **UNet learning rate:** 0.00005
 - **CLIP text-encoder learning rate:** 0.000001
-- **Hardware:** Training was performed using Intel Gaudi 2 and NVIDIA RTX A6000 GPUs
 ## Evaluation

 The SPRIGHT-T2I model is a text-to-image diffusion model with high spatial coherency. It was first introduced in [Getting it Right: Improving Spatial Consistency in Text-to-Image Models](https://), authored by Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo,
 Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang.
+SPRIGHT-T2I model was finetuned from stable diffusion v2.1 on a subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), which contains images and spatially focused captions. Leveraging SPRIGHT, along with efficient training techniques, we achieve state-of-the art performance in generating spatially accurate images from text.
 The training code and more details available in [SPRIGHT-T2I GitHub Repository](https://github.com/orgs/SPRIGHT-T2I).
 Additional examples that emphasize spatial coherence:
 <img src="result_images/visor.png" width="1000" alt="img">
+## Bias and Limitations
+The biases and limitation as specified in [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) apply here as well.
+## Training
+#### Training Data
+Our training and validation set are a subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), and consists of 444 and
 50 images respectively, randomly sampled in a 50:50 split between LAION-Aesthetics and Segment Anything. Each image is paired with both, a general and a spatial caption
 (from SPRIGHT). During fine-tuning, for each image, we randomly choose one of the given caption types in a 50:50 ratio.
 To construct our dataset, we focused on images with object counts larger than 18, utilizing the open-world image tagging model
 [Recognize Anything](https://huggingface.co/xinyu1205/recognize-anything-plus-model) to achieve this constraint.
+#### Training Procedure
 Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP-ViT/H text-encoder as part of our training for 10,000 steps, with different learning rates.
 - **Batch:** 4 x 8 = 32
 - **UNet learning rate:** 0.00005
 - **CLIP text-encoder learning rate:** 0.000001
+- **Hardware:** Training was performed using NVIDIA RTX A6000 GPUs and Intel®Gaudi®2 AI accelerators.
 ## Evaluation