Zongxia commited on
Commit
ae256ae
•
1 Parent(s): a009b5e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -54
README.md CHANGED
@@ -14,7 +14,7 @@ pipeline_tag: text-classification
14
  [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
  [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
 
17
- QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various efficient and basic metrics to assess the performance of QA models.
18
 
19
  ### Updates
20
  - Uopdated to version 0.2.8
@@ -33,51 +33,29 @@ To install the package, run the following command:
33
  pip install qa-metrics
34
  ```
35
 
36
- ## Usage
37
 
38
- The python package currently provides six QA evaluation methods.
 
 
 
 
 
39
 
40
- #### Prompting LLM For Evaluation
 
41
 
42
- Note: The prompting function can be used for any prompting purposes.
43
 
44
- ###### OpenAI
45
- ```python
46
- from qa_metrics.prompt_llm import CloseLLM
47
- model = CloseLLM()
48
- model.set_openai_api_key(YOUR_OPENAI_KEY)
49
- prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
50
- model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
51
 
52
- '''
53
- 'correct'
54
- '''
55
- ```
56
 
57
- ###### Anthropic
58
- ```python
59
- model = CloseLLM()
60
- model.set_anthropic_api_key(YOUR_Anthropic_KEY)
61
- model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
62
 
63
- '''
64
- 'correct'
65
- '''
66
- ```
67
 
68
- ###### deepinfra (See below for descriptions of more models)
69
- ```python
70
- from qa_metrics.prompt_open_llm import OpenLLM
71
- model = OpenLLM()
72
- model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
73
- model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
74
-
75
- '''
76
- 'correct'
77
- '''
78
- ```
79
-
80
- #### Exact Match
81
  ```python
82
  from qa_metrics.em import em_match
83
 
@@ -90,39 +68,80 @@ Exact Match: False
90
  '''
91
  ```
92
 
93
- #### Transformer Match
94
- Our fine-tuned BERT model is this repository. Our Package also supports downloading and matching directly. distilroberta, distilbert, and roberta are also supported now! 🔥🔥🔥
95
 
96
- ```python
97
- from qa_metrics.transformerMatcher import TransformerMatcher
98
 
99
- question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
100
- tm = TransformerMatcher("roberta")
101
- scores = tm.get_scores(reference_answer, candidate_answer, question)
102
- match_result = tm.transformer_match(reference_answer, candidate_answer, question)
103
- print("Score: %s; TM Match: %s" % (scores, match_result))
104
- '''
105
- Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.88954514}, 'The Princess and the Frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.9381995}}; TM Match: True
106
- '''
107
- ```
108
 
 
109
 
110
- #### F1 Score
111
  ```python
112
  from qa_metrics.f1 import f1_match,f1_score_with_precision_recall
113
 
114
  f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
115
  print("F1 stats: ", f1_stats)
 
 
 
116
 
117
  match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
118
  print("F1 Match: ", match_result)
119
  '''
120
- F1 stats: {'f1': 0.25, 'precision': 0.6666666666666666, 'recall': 0.15384615384615385}
121
  F1 Match: False
122
  '''
123
  ```
124
 
125
- #### PANDA Match
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
  ```python
127
  from qa_metrics.pedant import PEDANT
128
 
@@ -146,6 +165,77 @@ print(pedant.get_score(reference_answer[1], candidate_answer, question))
146
  '''
147
  ```
148
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  If you find this repo avialable, please cite our paper:
150
  ```bibtex
151
  @misc{li2024panda,
 
14
  [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
  [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
 
17
+ QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic and efficient metrics to assess the performance of QA models.
18
 
19
  ### Updates
20
  - Uopdated to version 0.2.8
 
33
  pip install qa-metrics
34
  ```
35
 
36
+ ## Usage/Logistics
37
 
38
+ The python package currently provides six QA evaluation methods.
39
+ - Given a set of gold answers, a candidate answer to be evaluated, and a question (if applicable), the evaluation returns True if the candidate answer matches any one of the gold answer, False otherwise.
40
+ - Different evaluation methods have distinct strictness of evaluating the correctness of a candidate answer. Some have higher correlation with human judgments than others.
41
+ - Normalized Exact Match and Question/Answer type Evaluation are the most efficient method. They are suitable for short-form QA datasets such as NQ-OPEN, Hotpot QA, TriviaQA, SQuAD, etc.
42
+ - Question/Answer Type Evaluation and Transformer Neural evaluations are cost free and suitable for short-form and longer-form QA datasets. They have higher correlation with human judgments than exact match and F1 score when the length of the gold and candidate answers become long.
43
+ - Black-box LLM evaluations are closest to human evaluations, and they are not cost-free.
44
 
45
+ ## Normalized Exact Match
46
+ #### `em_match`
47
 
48
+ Returns a boolean indicating whether there are any exact normalized matches between gold and candidate answers.
49
 
50
+ **Parameters**
 
 
 
 
 
 
51
 
52
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
53
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
 
 
54
 
55
+ **Returns**
 
 
 
 
56
 
57
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
 
 
 
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ```python
60
  from qa_metrics.em import em_match
61
 
 
68
  '''
69
  ```
70
 
71
+ ## F1 Score
72
+ #### `f1_score_with_precision_recall`
73
 
74
+ Calculates F1 score, precision, and recall between a reference and a candidate answer.
 
75
 
76
+ **Parameters**
77
+
78
+ - `reference_answer` (str): A gold (correct) answers to the question.
79
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
80
+
81
+ **Returns**
 
 
 
82
 
83
+ - `dictionary`: A dictionary containing the F1 score, precision, and recall between a gold and candidate answer.
84
 
 
85
  ```python
86
  from qa_metrics.f1 import f1_match,f1_score_with_precision_recall
87
 
88
  f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
89
  print("F1 stats: ", f1_stats)
90
+ '''
91
+ F1 stats: {'f1': 0.25, 'precision': 0.6666666666666666, 'recall': 0.15384615384615385}
92
+ '''
93
 
94
  match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
95
  print("F1 Match: ", match_result)
96
  '''
 
97
  F1 Match: False
98
  '''
99
  ```
100
 
101
+ ## Efficient and Robust Question/Answer Type Evaluation
102
+ #### 1. `get_highest_score`
103
+
104
+ Returns the gold answer and candidate answer pair that has the highest matching score. This function is useful for evaluating the closest match to a given candidate response based on a list of reference answers.
105
+
106
+ **Parameters**
107
+
108
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
109
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
110
+ - `question` (str): The question for which the answers are being evaluated.
111
+
112
+ **Returns**
113
+
114
+ - `dictionary`: A dictionary containing the gold answer and candidate answer that have the highest matching score.
115
+
116
+ #### 2. `get_scores`
117
+
118
+ Returns all the gold answer and candidate answer pairs' matching scores.
119
+
120
+ **Parameters**
121
+
122
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
123
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
124
+ - `question` (str): The question for which the answers are being evaluated.
125
+
126
+ **Returns**
127
+
128
+ - `dictionary`: A dictionary containing gold answers and the candidate answer's matching score.
129
+
130
+ #### 3. `evaluate`
131
+
132
+ Returns True if the candidate answer is a match of any of the gold answers.
133
+
134
+ **Parameters**
135
+
136
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
137
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
138
+ - `question` (str): The question for which the answers are being evaluated.
139
+
140
+ **Returns**
141
+
142
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
143
+
144
+
145
  ```python
146
  from qa_metrics.pedant import PEDANT
147
 
 
165
  '''
166
  ```
167
 
168
+ ## Transformer Neural Evaluation
169
+ Our fine-tuned BERT model is on 🤗 [Huggingface](https://huggingface.co/Zongxia/answer_equivalence_bert?text=The+goal+of+life+is+%5BMASK%5D.). Our Package also supports downloading and matching directly. [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), [roberta](https://huggingface.co/Zongxia/answer_equivalence_roberta), and [roberta-large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large) are also supported now! 🔥🔥🔥
170
+
171
+ #### `transformer_match`
172
+
173
+ Returns True if the candidate answer is a match of any of the gold answers.
174
+
175
+ **Parameters**
176
+
177
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
178
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
179
+ - `question` (str): The question for which the answers are being evaluated.
180
+
181
+ **Returns**
182
+
183
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
184
+
185
+ ```python
186
+ from qa_metrics.transformerMatcher import TransformerMatcher
187
+
188
+ question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
189
+ # Supported models: roberta-large, roberta, bert, distilbert, distilroberta
190
+ tm = TransformerMatcher("roberta-large")
191
+ scores = tm.get_scores(reference_answer, candidate_answer, question)
192
+ match_result = tm.transformer_match(reference_answer, candidate_answer, question)
193
+ print("Score: %s; bert Match: %s" % (scores, match_result))
194
+ '''
195
+ Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.6934309}, 'The Princess and the Frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7400551}}; TM Match: True
196
+ '''
197
+ ```
198
+
199
+ ## Prompting LLM For Evaluation
200
+
201
+ Note: The prompting function can be used for any prompting purposes.
202
+
203
+ ###### OpenAI
204
+ ```python
205
+ from qa_metrics.prompt_llm import CloseLLM
206
+ model = CloseLLM()
207
+ model.set_openai_api_key(YOUR_OPENAI_KEY)
208
+ prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
209
+ model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
210
+
211
+ '''
212
+ 'correct'
213
+ '''
214
+ ```
215
+
216
+ ###### Anthropic
217
+ ```python
218
+ model = CloseLLM()
219
+ model.set_anthropic_api_key(YOUR_Anthropic_KEY)
220
+ model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
221
+
222
+ '''
223
+ 'correct'
224
+ '''
225
+ ```
226
+
227
+ ###### deepinfra (See below for descriptions of more models)
228
+ ```python
229
+ from qa_metrics.prompt_open_llm import OpenLLM
230
+ model = OpenLLM()
231
+ model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
232
+ model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
233
+
234
+ '''
235
+ 'correct'
236
+ '''
237
+ ```
238
+
239
  If you find this repo avialable, please cite our paper:
240
  ```bibtex
241
  @misc{li2024panda,