Seongyun commited on
Commit
68ad8cb
1 Parent(s): 620f2ab

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +216 -0
README.md ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - kaist-ai/Perception-Collection
5
+ - kaist-ai/Perception-Bench
6
+ language:
7
+ - en
8
+ metrics:
9
+ - pearsonr
10
+ - spearmanr
11
+ library_name: transformers
12
+ pipeline_tag: image-to-text
13
+ tags:
14
+ - Image-to-Text
15
+ - Visual Question Answering
16
+ - Text2Text Generation
17
+ ---
18
+ ## Links for Reference
19
+ - **Homepage:**
20
+ - **Repository: https://github.com/kaistAI/prometheus-vision**
21
+ - **Paper: https://arxiv.org/abs/2401.06591**
22
+ - **Point of Contact: seongyun@kaist.ac.kr**
23
+ # TL;DR
24
+ Prometheus-Vision is the first open-source VLM specialized for evaluation purposes. Prometheus-Vision shows a high correlation with both GPT-4V and human evaluators, indicating its potential to be used as a cheap alternative for GPT-4V evaluation.
25
+ ![image/png](./prometheus_vision.png)
26
+ # Model Details
27
+
28
+ ## Model Description
29
+ - **Model type:** Vision-Language Model
30
+ - **Language(s) (NLP):** English
31
+ - **License:** Apache 2.0
32
+ - **Related Models:** [All Prometheus Checkpoints](https://huggingface.co/models?search=kaist-ai/Prometheus-Vision)
33
+ - **Resources for more information:**
34
+ - [Research paper](https://arxiv.org/abs/2401.06591)
35
+ - [GitHub Repo](https://github.com/kaistAI/prometheus-vision)
36
+
37
+ Prometheu-Vision is trained with two different sizes (7B and 13B).
38
+ You could check the 13B sized VLM on [this page](https://huggingface.co/kaist-ai/prometheus-vision-13b-v1.0).
39
+ Also, check out our dataset as well on [this page](https://huggingface.co/datasets/kaist-ai/Perception-Collection).
40
+ ## Prompt Format
41
+ Prometheus-Vision requires 5 components in the input: An image, an instruction, a response to evaluate, a score rubric, and a reference answer. You could refer to the prompt format below.
42
+ You should fill in the instruction, response, reference answer, criteria description, and score description for score in range of 1 to 5.
43
+ ```
44
+ ###Task Description:
45
+ An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, an image and a score rubric representing an evaluation criterion is given.
46
+ 1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
47
+ 2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
48
+ 3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)\"
49
+ 4. Please do not generate any other opening, closing, and explanations.
50
+
51
+ ###The instruction to evaluate:
52
+ {instruction}
53
+
54
+ ###Response to evaluate:
55
+ {response}
56
+
57
+ ###Reference Answer (Score 5):
58
+ {reference_answer}
59
+
60
+ ###Score Rubrics:
61
+ [{criteria_description}]
62
+ Score 1: {score1_description}
63
+ Score 2: {score2_description}
64
+ Score 3: {score3_description}
65
+ Score 4: {score4_description}
66
+ Score 5: {score5_description}
67
+
68
+ ###Feedback:
69
+ ```
70
+ Also, we use the following output format. During inference, you could parse the score by splitting the number that is generated next to the [RESULT] phrase.
71
+ ```
72
+ {orig_feedback}
73
+ [RESULT] {orig_score}
74
+ ```
75
+ ## License
76
+ Perception Collection and Prometheus-Vision are subject to OpenAI's Terms of Use for the generated data. If you suspect any violations, please reach out to us.
77
+ # Usage
78
+ Find below some example scripts on how to use the model in `transformers`:
79
+ ## Using the Pytorch model
80
+ ### Running the model on a GPU
81
+ <details>
82
+ <summary> Click to expand </summary>
83
+
84
+ ```python
85
+ import argparse
86
+ import torch
87
+ import os
88
+ import json
89
+ from tqdm import tqdm
90
+ import shortuuid
91
+
92
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
93
+ from llava.conversation import conv_templates, SeparatorStyle
94
+ from llava.model.builder import load_pretrained_model
95
+ from llava.utils import disable_torch_init
96
+ from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
97
+
98
+ from PIL import Image
99
+ import math
100
+
101
+
102
+ def split_list(lst, n):
103
+ """Split a list into n (roughly) equal-sized chunks"""
104
+ chunk_size = math.ceil(len(lst) / n) # integer division
105
+ return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
106
+
107
+
108
+ def get_chunk(lst, n, k):
109
+ chunks = split_list(lst, n)
110
+ return chunks[k]
111
+
112
+
113
+ def eval_model(args):
114
+ # Model
115
+ disable_torch_init()
116
+ model_path = 'kaist-ai/prometheus-vision-7b-v1.0'
117
+ model_name = 'llava-v1.5'
118
+ tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
119
+
120
+ questions = [json.loads(q) for q in open(os.path.expanduser(args.question_file), "r")]
121
+ questions = get_chunk(questions, args.num_chunks, args.chunk_idx)
122
+ answers_file = os.path.expanduser(args.answers_file)
123
+ os.makedirs(os.path.dirname(answers_file), exist_ok=True)
124
+ ans_file = open(answers_file, "w")
125
+ for line in tqdm(questions):
126
+ idx = line["question_id"]
127
+ image_file = line["image"]
128
+ qs = line["text"]
129
+ cur_prompt = qs
130
+ if model.config.mm_use_im_start_end:
131
+ qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
132
+ else:
133
+ qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
134
+
135
+ conv = conv_templates[args.conv_mode].copy()
136
+ conv.append_message(conv.roles[0], qs)
137
+ conv.append_message(conv.roles[1], None)
138
+ prompt = conv.get_prompt()
139
+
140
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
141
+
142
+ image = Image.open(os.path.join(args.image_folder, image_file))
143
+ image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
144
+
145
+ stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
146
+ keywords = [stop_str]
147
+ stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
148
+
149
+ with torch.inference_mode():
150
+ output_ids = model.generate(
151
+ input_ids,
152
+ images=image_tensor.unsqueeze(0).half().cuda(),
153
+ do_sample=True if args.temperature > 0 else False,
154
+ temperature=args.temperature,
155
+ top_p=args.top_p,
156
+ num_beams=args.num_beams,
157
+ # no_repeat_ngram_size=3,
158
+ max_new_tokens=1024,
159
+ use_cache=True)
160
+
161
+ input_token_len = input_ids.shape[1]
162
+ n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
163
+ if n_diff_input_output > 0:
164
+ print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
165
+ outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
166
+ outputs = outputs.strip()
167
+ if outputs.endswith(stop_str):
168
+ outputs = outputs[:-len(stop_str)]
169
+ outputs = outputs.strip()
170
+
171
+ ans_id = shortuuid.uuid()
172
+ ans_file.write(json.dumps({"question_id": idx,
173
+ "prompt": cur_prompt,
174
+ "text": outputs,
175
+ "answer_id": ans_id,
176
+ "model_id": model_name,
177
+ "metadata": {}}) + "\n")
178
+ ans_file.flush()
179
+ ans_file.close()
180
+
181
+ if __name__ == "__main__":
182
+ parser = argparse.ArgumentParser()
183
+ parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
184
+ parser.add_argument("--model-base", type=str, default=None)
185
+ parser.add_argument("--image-folder", type=str, default="")
186
+ parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
187
+ parser.add_argument("--answers-file", type=str, default="answer.jsonl")
188
+ parser.add_argument("--conv-mode", type=str, default="llava_v1")
189
+ parser.add_argument("--num-chunks", type=int, default=1)
190
+ parser.add_argument("--chunk-idx", type=int, default=0)
191
+ parser.add_argument("--temperature", type=float, default=0.2)
192
+ parser.add_argument("--top_p", type=float, default=None)
193
+ parser.add_argument("--num_beams", type=int, default=1)
194
+ args = parser.parse_args()
195
+
196
+ eval_model(args)
197
+
198
+ ```
199
+ </details>
200
+
201
+ # Citation
202
+
203
+ If you find the following model helpful, please consider citing our paper!
204
+
205
+ **BibTeX:**
206
+
207
+ ```bibtex
208
+ @misc{lee2024prometheusvision,
209
+ title={Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation},
210
+ author={Seongyun Lee and Seungone Kim and Sue Hyun Park and Geewook Kim and Minjoon Seo},
211
+ year={2024},
212
+ eprint={2401.06591},
213
+ archivePrefix={arXiv},
214
+ primaryClass={cs.CL}
215
+ }
216
+ ```