Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ImageReward
|
2 |
+
|
3 |
+
ImageReward is the first general-purpose text-to-image human preference RM which is trained on in total 137k pairs of expert comparisons, based on text prompts and corresponding model outputs from DiffusionDB. We demonstrate that ImageReward outperforms existing text-image scoring methods, such as CLIP, Aesthetic, and BLIP, in terms of understanding human preference in text-to-image synthesis through extensive analysis and experiments.
|
4 |
+
|
5 |
+
## Approach
|
6 |
+
|
7 |
+
![ImageReward](ImageReward.png)
|
8 |
+
|
9 |
+
## Setup
|
10 |
+
|
11 |
+
* Environment: install dependencies via `pip install -r requirements.txt`.
|
12 |
+
|
13 |
+
## Usage
|
14 |
+
|
15 |
+
```python
|
16 |
+
import os
|
17 |
+
import torch
|
18 |
+
import ImageReward as reward
|
19 |
+
|
20 |
+
if __name__ == "__main__":
|
21 |
+
prompt = "a painting of an ocean with clouds and birds, day time, low depth field effect"
|
22 |
+
img_prefix = "assets/images"
|
23 |
+
generations = [f"{pic_id}.webp" for pic_id in range(1, 5)]
|
24 |
+
img_list = [os.path.join(img_prefix, img) for img in generations]
|
25 |
+
model = reward.load()
|
26 |
+
with torch.no_grad():
|
27 |
+
ranking, rewards = model.inference_rank(prompt, img_list)
|
28 |
+
# Print the result
|
29 |
+
print("\nPreference predictions:\n")
|
30 |
+
print(f"ranking = {ranking}")
|
31 |
+
print(f"rewards = {rewards}")
|
32 |
+
for index in range(len(img_list)):
|
33 |
+
score = model.score(prompt, img_list[index])
|
34 |
+
print(f"{generations[index]:>16s}: {score:.2f}")
|
35 |
+
|
36 |
+
```
|
37 |
+
|
38 |
+
The output will look like the following (the exact numbers may be slightly different depending on the compute device):
|
39 |
+
|
40 |
+
```
|
41 |
+
Preference predictions:
|
42 |
+
|
43 |
+
ranking = [1, 2, 3, 4]
|
44 |
+
rewards = [[0.5811622738838196], [0.2745276093482971], [-1.4131819009780884], [-2.029569625854492]]
|
45 |
+
1.webp: 0.58
|
46 |
+
2.webp: 0.27
|
47 |
+
3.webp: -1.41
|
48 |
+
4.webp: -2.03
|
49 |
+
```
|
50 |
+
|
51 |
+
## Test
|
52 |
+
|
53 |
+
### Setup for baselines
|
54 |
+
|
55 |
+
#### Environment
|
56 |
+
|
57 |
+
```bash
|
58 |
+
$ pip install git+https://github.com/openai/CLIP.git
|
59 |
+
```
|
60 |
+
|
61 |
+
#### Checkpoint
|
62 |
+
|
63 |
+
Models | File Paths | Download Links
|
64 |
+
--- | :---: | :---:
|
65 |
+
ImageReward | checkpoint/ | <a href="https://huggingface.co/THUDM/ImageReward/blob/main/ImageReward.pt">Download</a>
|
66 |
+
CLIP Score | checkpoint/clip/ | <a href="https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt">Download</a>
|
67 |
+
BLIP Score | checkpoint/blip/ | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large.pth">Download</a>
|
68 |
+
Aesthetic | checkpoint/aesthetic/ | <a href="https://github.com/christophschuhmann/improved-aesthetic-predictor/raw/main/sac%2Blogos%2Bava1-l14-linearMSE.pth">Download</a>
|
69 |
+
|
70 |
+
#### Data
|
71 |
+
|
72 |
+
Data | File Paths | Download Links
|
73 |
+
--- | :---: | :---:
|
74 |
+
test_images | data/ | <a href="https://huggingface.co/THUDM/ImageReward/blob/main/test_images.zip">Download</a>
|
75 |
+
|
76 |
+
Download `test_images.zip` and unzip it to `data/test_images/`
|
77 |
+
|
78 |
+
### One step for test
|
79 |
+
|
80 |
+
```bash
|
81 |
+
$ python test.py
|
82 |
+
```
|
83 |
+
|
84 |
+
The test result is:
|
85 |
+
|
86 |
+
Models | Preference Acc.
|
87 |
+
--- | :---:
|
88 |
+
CLIP Score | 54.82
|
89 |
+
Aesthetic Score | 57.35
|
90 |
+
BLIP Score | 57.76
|
91 |
+
ImageReward (Ours) | **65.14**
|