Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,7 @@ library_name: diffusers
|
|
7 |
The SPRIGHT-T2I model is a text-to-image diffusion model with high spatial coherency. It was first introduced in [Getting it Right: Improving Spatial Consistency in Text-to-Image Models](https://), authored by Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo,
|
8 |
Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang.
|
9 |
|
10 |
-
SPRIGHT-T2I model was finetuned from stable diffusion v2.1 on a
|
11 |
|
12 |
The training code and more details available in [SPRIGHT-T2I GitHub Repository](https://github.com/orgs/SPRIGHT-T2I).
|
13 |
|
@@ -56,15 +56,15 @@ image.save("kitten_sittin_in_a_dish.png")
|
|
56 |
Additional examples that emphasize spatial coherence:
|
57 |
<img src="result_images/visor.png" width="1000" alt="img">
|
58 |
|
59 |
-
##
|
60 |
|
61 |
-
The [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1)
|
62 |
|
63 |
-
## Training
|
64 |
|
65 |
-
|
66 |
|
67 |
-
Our training and validation set are a
|
68 |
50 images respectively, randomly sampled in a 50:50 split between LAION-Aesthetics and Segment Anything. Each image is paired with both, a general and a spatial caption
|
69 |
(from SPRIGHT). During fine-tuning, for each image, we randomly choose one of the given caption types in a 50:50 ratio.
|
70 |
|
@@ -73,7 +73,7 @@ Additionally, we find that training on images containing a large number of objec
|
|
73 |
To construct our dataset, we focused on images with object counts larger than 18, utilizing the open-world image tagging model
|
74 |
[Recognize Anything](https://huggingface.co/xinyu1205/recognize-anything-plus-model) to achieve this constraint.
|
75 |
|
76 |
-
|
77 |
|
78 |
Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP-ViT/H text-encoder as part of our training for 10,000 steps, with different learning rates.
|
79 |
|
@@ -83,7 +83,7 @@ Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP
|
|
83 |
- **Batch:** 4 x 8 = 32
|
84 |
- **UNet learning rate:** 0.00005
|
85 |
- **CLIP text-encoder learning rate:** 0.000001
|
86 |
-
- **Hardware:** Training was performed using
|
87 |
|
88 |
|
89 |
## Evaluation
|
|
|
7 |
The SPRIGHT-T2I model is a text-to-image diffusion model with high spatial coherency. It was first introduced in [Getting it Right: Improving Spatial Consistency in Text-to-Image Models](https://), authored by Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo,
|
8 |
Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang.
|
9 |
|
10 |
+
SPRIGHT-T2I model was finetuned from stable diffusion v2.1 on a subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), which contains images and spatially focused captions. Leveraging SPRIGHT, along with efficient training techniques, we achieve state-of-the art performance in generating spatially accurate images from text.
|
11 |
|
12 |
The training code and more details available in [SPRIGHT-T2I GitHub Repository](https://github.com/orgs/SPRIGHT-T2I).
|
13 |
|
|
|
56 |
Additional examples that emphasize spatial coherence:
|
57 |
<img src="result_images/visor.png" width="1000" alt="img">
|
58 |
|
59 |
+
## Bias and Limitations
|
60 |
|
61 |
+
The biases and limitation as specified in [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) apply here as well.
|
62 |
|
63 |
+
## Training
|
64 |
|
65 |
+
#### Training Data
|
66 |
|
67 |
+
Our training and validation set are a subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), and consists of 444 and
|
68 |
50 images respectively, randomly sampled in a 50:50 split between LAION-Aesthetics and Segment Anything. Each image is paired with both, a general and a spatial caption
|
69 |
(from SPRIGHT). During fine-tuning, for each image, we randomly choose one of the given caption types in a 50:50 ratio.
|
70 |
|
|
|
73 |
To construct our dataset, we focused on images with object counts larger than 18, utilizing the open-world image tagging model
|
74 |
[Recognize Anything](https://huggingface.co/xinyu1205/recognize-anything-plus-model) to achieve this constraint.
|
75 |
|
76 |
+
#### Training Procedure
|
77 |
|
78 |
Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP-ViT/H text-encoder as part of our training for 10,000 steps, with different learning rates.
|
79 |
|
|
|
83 |
- **Batch:** 4 x 8 = 32
|
84 |
- **UNet learning rate:** 0.00005
|
85 |
- **CLIP text-encoder learning rate:** 0.000001
|
86 |
+
- **Hardware:** Training was performed using NVIDIA RTX A6000 GPUs and Intel®Gaudi®2 AI accelerators.
|
87 |
|
88 |
|
89 |
## Evaluation
|