Zongxia commited on
Commit
ca5be03
1 Parent(s): aa1723d

answer equivalence tiny Bert

Browse files
README.md CHANGED
@@ -1,3 +1,263 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ license: mit
4
+ language:
5
+ - en
6
+ metrics:
7
+ - exact_match
8
+ - f1
9
+ - bertscore
10
+ pipeline_tag: text-classification
11
+ ---
12
+ # QA-Evaluation-Metrics
13
+
14
+ [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
+
17
+ QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic and efficient metrics to assess the performance of QA models.
18
+
19
+ ### Updates
20
+ - Uopdated to version 0.2.8
21
+ - Supports prompting OPENAI GPT-series models and Claude Series models now. (Assuimg OPENAI version > 1.0)
22
+ - Supports prompting various open source models such as LLaMA-2-70B-chat, LLaVA-1.5 etc by calling API from [deepinfra](https://deepinfra.com/models).
23
+
24
+
25
+ ## Installation
26
+ * Python version >= 3.6
27
+ * openai version >= 1.0
28
+
29
+
30
+ To install the package, run the following command:
31
+
32
+ ```bash
33
+ pip install qa-metrics
34
+ ```
35
+
36
+ ## Usage/Logistics
37
+
38
+ The python package currently provides six QA evaluation methods.
39
+ - Given a set of gold answers, a candidate answer to be evaluated, and a question (if applicable), the evaluation returns True if the candidate answer matches any one of the gold answer, False otherwise.
40
+ - Different evaluation methods have distinct strictness of evaluating the correctness of a candidate answer. Some have higher correlation with human judgments than others.
41
+ - Normalized Exact Match and Question/Answer type Evaluation are the most efficient method. They are suitable for short-form QA datasets such as NQ-OPEN, Hotpot QA, TriviaQA, SQuAD, etc.
42
+ - Question/Answer Type Evaluation and Transformer Neural evaluations are cost free and suitable for short-form and longer-form QA datasets. They have higher correlation with human judgments than exact match and F1 score when the length of the gold and candidate answers become long.
43
+ - Black-box LLM evaluations are closest to human evaluations, and they are not cost-free.
44
+
45
+ ## Normalized Exact Match
46
+ #### `em_match`
47
+
48
+ Returns a boolean indicating whether there are any exact normalized matches between gold and candidate answers.
49
+
50
+ **Parameters**
51
+
52
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
53
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
54
+
55
+ **Returns**
56
+
57
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
58
+
59
+ ```python
60
+ from qa_metrics.em import em_match
61
+
62
+ reference_answer = ["The Frog Prince", "The Princess and the Frog"]
63
+ candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
64
+ match_result = em_match(reference_answer, candidate_answer)
65
+ print("Exact Match: ", match_result)
66
+ '''
67
+ Exact Match: False
68
+ '''
69
+ ```
70
+
71
+ ## F1 Score
72
+ #### `f1_score_with_precision_recall`
73
+
74
+ Calculates F1 score, precision, and recall between a reference and a candidate answer.
75
+
76
+ **Parameters**
77
+
78
+ - `reference_answer` (str): A gold (correct) answers to the question.
79
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
80
+
81
+ **Returns**
82
+
83
+ - `dictionary`: A dictionary containing the F1 score, precision, and recall between a gold and candidate answer.
84
+
85
+ ```python
86
+ from qa_metrics.f1 import f1_match,f1_score_with_precision_recall
87
+
88
+ f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
89
+ print("F1 stats: ", f1_stats)
90
+ '''
91
+ F1 stats: {'f1': 0.25, 'precision': 0.6666666666666666, 'recall': 0.15384615384615385}
92
+ '''
93
+
94
+ match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
95
+ print("F1 Match: ", match_result)
96
+ '''
97
+ F1 Match: False
98
+ '''
99
+ ```
100
+
101
+ ## Efficient and Robust Question/Answer Type Evaluation
102
+ #### 1. `get_highest_score`
103
+
104
+ Returns the gold answer and candidate answer pair that has the highest matching score. This function is useful for evaluating the closest match to a given candidate response based on a list of reference answers.
105
+
106
+ **Parameters**
107
+
108
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
109
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
110
+ - `question` (str): The question for which the answers are being evaluated.
111
+
112
+ **Returns**
113
+
114
+ - `dictionary`: A dictionary containing the gold answer and candidate answer that have the highest matching score.
115
+
116
+ #### 2. `get_scores`
117
+
118
+ Returns all the gold answer and candidate answer pairs' matching scores.
119
+
120
+ **Parameters**
121
+
122
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
123
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
124
+ - `question` (str): The question for which the answers are being evaluated.
125
+
126
+ **Returns**
127
+
128
+ - `dictionary`: A dictionary containing gold answers and the candidate answer's matching score.
129
+
130
+ #### 3. `evaluate`
131
+
132
+ Returns True if the candidate answer is a match of any of the gold answers.
133
+
134
+ **Parameters**
135
+
136
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
137
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
138
+ - `question` (str): The question for which the answers are being evaluated.
139
+
140
+ **Returns**
141
+
142
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
143
+
144
+
145
+ ```python
146
+ from qa_metrics.pedant import PEDANT
147
+
148
+ question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
149
+ pedant = PEDANT()
150
+ scores = pedant.get_scores(reference_answer, candidate_answer, question)
151
+ max_pair, highest_scores = pedant.get_highest_score(reference_answer, candidate_answer, question)
152
+ match_result = pedant.evaluate(reference_answer, candidate_answer, question)
153
+ print("Max Pair: %s; Highest Score: %s" % (max_pair, highest_scores))
154
+ print("Score: %s; PANDA Match: %s" % (scores, match_result))
155
+ '''
156
+ Max Pair: ('the princess and the frog', 'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"'); Highest Score: 0.854451712151719
157
+ Score: {'the frog prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7131625951317375}, 'the princess and the frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.854451712151719}}; PANDA Match: True
158
+ '''
159
+ ```
160
+
161
+ ```python
162
+ print(pedant.get_score(reference_answer[1], candidate_answer, question))
163
+ '''
164
+ 0.7122460127464126
165
+ '''
166
+ ```
167
+
168
+ ## Transformer Neural Evaluation
169
+ Our fine-tuned BERT model is on 🤗 [Huggingface](https://huggingface.co/Zongxia/answer_equivalence_bert?text=The+goal+of+life+is+%5BMASK%5D.). Our Package also supports downloading and matching directly. [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), [roberta](https://huggingface.co/Zongxia/answer_equivalence_roberta), and [roberta-large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large) are also supported now! 🔥🔥🔥
170
+
171
+ #### `transformer_match`
172
+
173
+ Returns True if the candidate answer is a match of any of the gold answers.
174
+
175
+ **Parameters**
176
+
177
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
178
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
179
+ - `question` (str): The question for which the answers are being evaluated.
180
+
181
+ **Returns**
182
+
183
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
184
+
185
+ ```python
186
+ from qa_metrics.transformerMatcher import TransformerMatcher
187
+
188
+ question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
189
+ # Supported models: roberta-large, roberta, bert, distilbert, distilroberta
190
+ tm = TransformerMatcher("roberta-large")
191
+ scores = tm.get_scores(reference_answer, candidate_answer, question)
192
+ match_result = tm.transformer_match(reference_answer, candidate_answer, question)
193
+ print("Score: %s; bert Match: %s" % (scores, match_result))
194
+ '''
195
+ Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.6934309}, 'The Princess and the Frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7400551}}; TM Match: True
196
+ '''
197
+ ```
198
+
199
+ ## Prompting LLM For Evaluation
200
+
201
+ Note: The prompting function can be used for any prompting purposes.
202
+
203
+ ###### OpenAI
204
+ ```python
205
+ from qa_metrics.prompt_llm import CloseLLM
206
+ model = CloseLLM()
207
+ model.set_openai_api_key(YOUR_OPENAI_KEY)
208
+ prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
209
+ model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
210
+
211
+ '''
212
+ 'correct'
213
+ '''
214
+ ```
215
+
216
+ ###### Anthropic
217
+ ```python
218
+ model = CloseLLM()
219
+ model.set_anthropic_api_key(YOUR_Anthropic_KEY)
220
+ model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
221
+
222
+ '''
223
+ 'correct'
224
+ '''
225
+ ```
226
+
227
+ ###### deepinfra (See below for descriptions of more models)
228
+ ```python
229
+ from qa_metrics.prompt_open_llm import OpenLLM
230
+ model = OpenLLM()
231
+ model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
232
+ model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
233
+
234
+ '''
235
+ 'correct'
236
+ '''
237
+ ```
238
+
239
+ If you find this repo avialable, please cite our paper:
240
+ ```bibtex
241
+ @misc{li2024panda,
242
+ title={PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation},
243
+ author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
244
+ year={2024},
245
+ eprint={2402.11161},
246
+ archivePrefix={arXiv},
247
+ primaryClass={cs.CL}
248
+ }
249
+ ```
250
+
251
+
252
+ ## Updates
253
+ - [01/24/24] 🔥 The full paper is uploaded and can be accessed [here](https://arxiv.org/abs/2402.11161). The dataset is expanded and leaderboard is updated.
254
+ - Our Training Dataset is adapted and augmented from [Bulian et al](https://github.com/google-research-datasets/answer-equivalence-dataset). Our [dataset repo](https://github.com/zli12321/Answer_Equivalence_Dataset.git) includes the augmented training set and QA evaluation testing sets discussed in our paper.
255
+ - Now our model supports [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), a smaller and more robust matching model than Bert!
256
+
257
+ ## License
258
+
259
+ This project is licensed under the [MIT License](LICENSE.md) - see the LICENSE file for details.
260
+
261
+ ## Contact
262
+
263
+ For any additional questions or comments, please contact [zli12321@umd.edu].
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/srv/www/active-topic-modeling/ae_tune/models--google--bert_uncased_L-2_H-128_A-2/snapshots/30b0a37ccaaa32f332884b96992754e246e48c5f",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 128,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 512,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 2,
17
+ "num_hidden_layers": 2,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "problem_type": "single_label_classification",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.37.2",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e3fcca05ef67be24f3ab588c63ac2f46cb6da5d84751fd0e8c6a25a246c4016
3
+ size 133
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 1000000000000000019884624838656,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff