ingyu commited on
Commit
0aed8fc
1 Parent(s): 8d368eb

Add KLUE-MRC metric (F1 and EM)

Browse files
Files changed (4) hide show
  1. README.md +97 -28
  2. compute_score.py +315 -0
  3. klue_mrc.py +104 -57
  4. tests.py +0 -17
README.md CHANGED
@@ -1,50 +1,119 @@
1
  ---
2
  title: KLUE MRC
3
- datasets:
4
- -
5
- tags:
6
- - evaluate
7
- - metric
8
- description: "TODO: add a description here"
9
  sdk: gradio
10
  sdk_version: 3.19.1
11
  app_file: app.py
12
  pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
- # Metric Card for KLUE MRC
 
 
16
 
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
18
 
19
- ## Metric Description
20
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
21
 
22
- ## How to Use
23
- *Give general statement of how to use the metric*
24
 
25
- *Provide simplest possible example for using the metric*
26
 
27
- ### Inputs
28
- *List all input arguments in the format below*
29
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
30
 
31
- ### Output Values
32
 
33
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
34
 
35
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 
 
 
36
 
37
- #### Values from Popular Papers
38
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 
 
39
 
40
- ### Examples
41
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- ## Limitations and Bias
44
- *Note any known limitations or biases that the metric has, with links and references if possible.*
45
 
46
  ## Citation
47
- *Cite the source where this metric was introduced.*
48
 
49
- ## Further References
50
- *Add any useful further references.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: KLUE MRC
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
 
 
 
6
  sdk: gradio
7
  sdk_version: 3.19.1
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
+ description: >-
14
+ This metric wrap the unofficial scoring script for [Machine Machine Reading Comprehension task of
15
+ Korean Language Understanding Evaluation (KLUE-MRC)](https://huggingface.co/datasets/klue/viewer/mrc/train).
16
+
17
+ KLUE-MRC is a Korean reading comprehension dataset consisting of questions where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
18
+
19
+ As KLUE-MRC has the same task format as SQuAD 2.0, this evaluation script uses the same metrics of SQuAD 2.0 (F1 and EM).
20
+
21
+ KLUE-MRC consists of 12,286 question paraphrasing, 7,931 multi-sentence reasoning, and 9,269 unanswerable questions. Totally, 29,313 examples are made with 22,343 documents and 23,717 passages.
22
  ---
23
 
24
+ # Metric Card for KLUE-MRC
25
+
26
+ Please note that as KLUE-MRC has the same task format as SQuAD 2.0, this evaluation script follows almost the same format as the official evaluation script for SQuAD 2.0.
27
 
28
+ ## Metric description
29
 
30
+ This metric wrap the unofficial scoring script for [Machine Machine Reading Comprehension task of
31
+ Korean Language Understanding Evaluation (KLUE-MRC)](https://huggingface.co/datasets/klue/viewer/mrc/train).
32
 
33
+ KLUE-MRC is a Korean reading comprehension dataset consisting of questions where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
 
34
 
35
+ As KLUE-MRC has the same task format as SQuAD 2.0, this evaluation script uses the same metrics of SQuAD 2.0 (F1 and EM).
36
 
37
+ KLUE-MRC consists of 12,286 question paraphrasing, 7,931 multi-sentence reasoning, and 9,269 unanswerable questions. Totally, 29,313 examples are made with 22,343 documents and 23,717 passages.
 
 
38
 
39
+ ## How to use
40
 
41
+ The metric takes two files or two lists - one representing model predictions and the other the references to compare them to.
42
 
43
+ *Predictions* : List of triple for question-answers to score with the following key-value pairs:
44
+ * `'id'`: the question-answer identification field of the question and answer pair
45
+ * `'prediction_text'` : the text of the answer
46
+ * `'no_answer_probability'` : the probability that the question has no answer
47
 
48
+ *References*: List of question-answers dictionaries with the following key-value pairs:
49
+ * `'id'`: id of the question-answer pair (see above),
50
+ * `'answers'`: a list of Dict {'text': text of the answer as a string}
51
+ * `'unanswerable'`: the boolean value indicating whether the question is answerable or not.
52
 
53
+ ```python
54
+ from evaluate import load
55
+ klue_mrc_metric = load("ingyu/klue_mrc")
56
+ results = klue_mrc_metric.compute(predictions=predictions, references=references)
57
+ ```
58
+
59
+ ## Output values
60
+
61
+ This metric outputs a dictionary with 13 values:
62
+ * `'exact'`: Exact match (the normalized answer exactly match the gold answer) (see the `exact_match` metric (forthcoming))
63
+ * `'f1'`: The average F1-score of predicted tokens versus the gold answer (see the [F1 score](https://huggingface.co/metrics/f1) metric)
64
+ * `'total'`: Number of scores considered
65
+ * `'HasAns_exact'`: Exact match (the normalized answer exactly match the gold answer)
66
+ * `'HasAns_f1'`: The F-score of predicted tokens versus the gold answer
67
+ * `'HasAns_total'`: How many of the questions have answers
68
+ * `'NoAns_exact'`: Exact match (the normalized answer exactly match the gold answer)
69
+ * `'NoAns_f1'`: The F-score of predicted tokens versus the gold answer
70
+ * `'NoAns_total'`: How many of the questions have no answers
71
+ * `'best_exact'` : Best exact match (with varying threshold)
72
+ * `'best_exact_thresh'`: No-answer probability threshold associated to the best exact match
73
+ * `'best_f1'`: Best F1 score (with varying threshold)
74
+ * `'best_f1_thresh'`: No-answer probability threshold associated to the best F1
75
+
76
+
77
+ The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched.
78
+
79
+ The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
80
+
81
+ The range of `total` depends on the length of predictions/references: its minimal value is 0, and maximal value is the total number of questions in the predictions and references.
82
+
83
+ ## Example
84
+
85
+ ```python
86
+ from evaluate import load
87
+ klue_mrc_metric = load("ingyu/klue_mrc")
88
+ predictions = [{'prediction_text': '2020', 'id': 'klue-mrc-v1_train_12311', 'no_answer_probability': 0.}]
89
+ references = [{'answers': {'answer_start': [ 38 ], 'text': [ '2020' ]}, 'id': 'klue-mrc-v1_train_12311'}]
90
+ results = klue_mrc_metric.compute(predictions=predictions, references=references)
91
+ results
92
+ {'exact': 100.0, 'f1': 100.0, 'total': 1, 'HasAns_exact': 100.0, 'HasAns_f1': 100.0, 'HasAns_total': 1, 'best_exact': 100.0, 'best_exact_thresh': 0.0, 'best_f1': 100.0, 'best_f1_thresh': 0.0}
93
+ ```
94
+
95
+ ## Limitations
96
+ This metric works only with the datasets in the same format as the [KLUE-MRC](https://huggingface.co/datasets/klue/viewer/mrc/train).
97
 
 
 
98
 
99
  ## Citation
 
100
 
101
+ ```bibtex
102
+ @inproceedings{NEURIPS DATASETS AND BENCHMARKS2021_98dce83d,
103
+ author = {Park, Sungjoon and Moon, Jihyung and Kim, Sungdong and Cho, Won Ik and Han, Ji Yoon and Park, Jangwon and Song, Chisung and Kim, Junseong and Song, Youngsook and Oh, Taehwan and Lee, Joohong and Oh, Juhyun and Lyu, Sungwon and Jeong, Younghoon and Lee, Inkwon and Seo, Sangwoo and Lee, Dongjun and Kim, Hyunwoo and Lee, Myeonghwa and Jang, Seongbo and Do, Seungwon and Kim, Sunkyoung and Lim, Kyungtae and Lee, Jongwon and Park, Kyumin and Shin, Jamin and Kim, Seonghyun and Park, Lucy and Park, Lucy and Oh, Alice and Ha (NAVER AI Lab), Jung-Woo and Cho, Kyunghyun and Cho, Kyunghyun},
104
+ booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
105
+ editor = {J. Vanschoren and S. Yeung},
106
+ pages = {},
107
+ publisher = {Curran},
108
+ title = {KLUE: Korean Language Understanding Evaluation},
109
+ url = {https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/98dce83da57b0395e163467c9dae521b-Paper-round2.pdf},
110
+ volume = {1},
111
+ year = {2021}
112
+ }
113
+ ```
114
+
115
+ ## Further References
116
+
117
+ - [LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) leverages this scoring script for the evaluation of [KLUE-MRC](https://huggingface.co/datasets/klue/viewer/mrc/train).
118
+ - [The Stanford Question Answering Dataset: Background, Challenges, Progress (blog post)](https://rajpurkar.github.io/mlx/qa-and-squad/)
119
+ - [Hugging Face Course -- Question Answering](https://huggingface.co/course/chapter7/7)
compute_score.py ADDED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Unofficial evaluation script for KLUE-MRC.
2
+
3
+ Please note that as KLUE-MRC has the same task format as SQuAD 2.0,
4
+ this evaluation script follows almost the same format as the official evaluation script for SQuAD 2.0.
5
+ """
6
+ import argparse
7
+ import collections
8
+ import json
9
+ import os
10
+ import string
11
+ import sys
12
+
13
+ import numpy as np
14
+
15
+
16
+ OPTS = None
17
+
18
+
19
+ def parse_args():
20
+ parser = argparse.ArgumentParser("Unofficial evaluation script for KLUE-MRC.")
21
+ parser.add_argument("data_file", metavar="data.json", help="Input data JSON file.")
22
+ parser.add_argument("pred_file", metavar="pred.json", help="Model predictions.")
23
+ parser.add_argument(
24
+ "--out-file", "-o", metavar="eval.json", help="Write accuracy metrics to file (default is stdout)."
25
+ )
26
+ parser.add_argument(
27
+ "--na-prob-file", "-n", metavar="na_prob.json", help="Model estimates of probability of no answer."
28
+ )
29
+ parser.add_argument(
30
+ "--na-prob-thresh",
31
+ "-t",
32
+ type=float,
33
+ default=1.0,
34
+ help='Predict "" if no-answer probability exceeds this (default = 1.0).',
35
+ )
36
+ parser.add_argument(
37
+ "--out-image-dir", "-p", metavar="out_images", default=None, help="Save precision-recall curves to directory."
38
+ )
39
+ parser.add_argument("--verbose", "-v", action="store_true")
40
+ if len(sys.argv) == 1:
41
+ parser.print_help()
42
+ sys.exit(1)
43
+ return parser.parse_args()
44
+
45
+
46
+ def make_qid_to_has_ans(dataset):
47
+ qid_to_has_ans = {}
48
+ for article in dataset:
49
+ for p in article["paragraphs"]:
50
+ for qa in p["qas"]:
51
+ qid_to_has_ans[qa["id"]] = not bool(qa["unanswerable"])
52
+ return qid_to_has_ans
53
+
54
+
55
+ def normalize_answer(s):
56
+ """Lower text and remove punctuation, articles and extra whitespace."""
57
+
58
+ def white_space_fix(text):
59
+ return " ".join(text.split())
60
+
61
+ def remove_punc(text):
62
+ exclude = set(string.punctuation)
63
+ return "".join(ch for ch in text if ch not in exclude)
64
+
65
+ def lower(text):
66
+ return text.lower()
67
+
68
+ return white_space_fix(remove_punc(lower(s)))
69
+
70
+
71
+ def get_tokens(s):
72
+ if not s:
73
+ return []
74
+ return normalize_answer(s).split()
75
+
76
+
77
+ def compute_exact(a_gold, a_pred):
78
+ return int(normalize_answer(a_gold) == normalize_answer(a_pred))
79
+
80
+
81
+ def compute_f1(a_gold, a_pred):
82
+ gold_toks = get_tokens(a_gold)
83
+ pred_toks = get_tokens(a_pred)
84
+ common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
85
+ num_same = sum(common.values())
86
+ if len(gold_toks) == 0 or len(pred_toks) == 0:
87
+ # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
88
+ return int(gold_toks == pred_toks)
89
+ if num_same == 0:
90
+ return 0
91
+ precision = 1.0 * num_same / len(pred_toks)
92
+ recall = 1.0 * num_same / len(gold_toks)
93
+ f1 = (2 * precision * recall) / (precision + recall)
94
+ return f1
95
+
96
+
97
+ def get_raw_scores(dataset, preds):
98
+ exact_scores = {}
99
+ f1_scores = {}
100
+ for article in dataset:
101
+ for p in article["paragraphs"]:
102
+ for qa in p["qas"]:
103
+ qid = qa["id"]
104
+ gold_answers = [t for t in qa["answers"]["text"] if normalize_answer(t)]
105
+ if qa["unanswerable"]:
106
+ # For unanswerable questions, only correct answer is empty string
107
+ gold_answers = [""]
108
+ if qid not in preds:
109
+ print(f"Missing prediction for {qid}")
110
+ continue
111
+ a_pred = preds[qid]
112
+ # Take max over all gold answers
113
+ exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
114
+ f1_scores[qid] = max(compute_f1(a, a_pred) for a in gold_answers)
115
+ return exact_scores, f1_scores
116
+
117
+
118
+ def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
119
+ new_scores = {}
120
+ for qid, s in scores.items():
121
+ pred_na = na_probs[qid] > na_prob_thresh
122
+ if pred_na:
123
+ new_scores[qid] = float(not qid_to_has_ans[qid])
124
+ else:
125
+ new_scores[qid] = s
126
+ return new_scores
127
+
128
+
129
+ def make_eval_dict(exact_scores, f1_scores, qid_list=None):
130
+ if not qid_list:
131
+ total = len(exact_scores)
132
+ return collections.OrderedDict(
133
+ [
134
+ ("exact", 100.0 * sum(exact_scores.values()) / total),
135
+ ("f1", 100.0 * sum(f1_scores.values()) / total),
136
+ ("total", total),
137
+ ]
138
+ )
139
+ else:
140
+ total = len(qid_list)
141
+ return collections.OrderedDict(
142
+ [
143
+ ("exact", 100.0 * sum(exact_scores[k] for k in qid_list) / total),
144
+ ("f1", 100.0 * sum(f1_scores[k] for k in qid_list) / total),
145
+ ("total", total),
146
+ ]
147
+ )
148
+
149
+
150
+ def merge_eval(main_eval, new_eval, prefix):
151
+ for k in new_eval:
152
+ main_eval[f"{prefix}_{k}"] = new_eval[k]
153
+
154
+
155
+ def plot_pr_curve(precisions, recalls, out_image, title):
156
+ plt.step(recalls, precisions, color="b", alpha=0.2, where="post")
157
+ plt.fill_between(recalls, precisions, step="post", alpha=0.2, color="b")
158
+ plt.xlabel("Recall")
159
+ plt.ylabel("Precision")
160
+ plt.xlim([0.0, 1.05])
161
+ plt.ylim([0.0, 1.05])
162
+ plt.title(title)
163
+ plt.savefig(out_image)
164
+ plt.clf()
165
+
166
+
167
+ def make_precision_recall_eval(scores, na_probs, num_true_pos, qid_to_has_ans, out_image=None, title=None):
168
+ qid_list = sorted(na_probs, key=lambda k: na_probs[k])
169
+ true_pos = 0.0
170
+ cur_p = 1.0
171
+ cur_r = 0.0
172
+ precisions = [1.0]
173
+ recalls = [0.0]
174
+ avg_prec = 0.0
175
+ for i, qid in enumerate(qid_list):
176
+ if qid_to_has_ans[qid]:
177
+ true_pos += scores[qid]
178
+ cur_p = true_pos / float(i + 1)
179
+ cur_r = true_pos / float(num_true_pos)
180
+ if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i + 1]]:
181
+ # i.e., if we can put a threshold after this point
182
+ avg_prec += cur_p * (cur_r - recalls[-1])
183
+ precisions.append(cur_p)
184
+ recalls.append(cur_r)
185
+ if out_image:
186
+ plot_pr_curve(precisions, recalls, out_image, title)
187
+ return {"ap": 100.0 * avg_prec}
188
+
189
+
190
+ def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs, qid_to_has_ans, out_image_dir):
191
+ if out_image_dir and not os.path.exists(out_image_dir):
192
+ os.makedirs(out_image_dir)
193
+ num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
194
+ if num_true_pos == 0:
195
+ return
196
+ pr_exact = make_precision_recall_eval(
197
+ exact_raw,
198
+ na_probs,
199
+ num_true_pos,
200
+ qid_to_has_ans,
201
+ out_image=os.path.join(out_image_dir, "pr_exact.png"),
202
+ title="Precision-Recall curve for Exact Match score",
203
+ )
204
+ pr_f1 = make_precision_recall_eval(
205
+ f1_raw,
206
+ na_probs,
207
+ num_true_pos,
208
+ qid_to_has_ans,
209
+ out_image=os.path.join(out_image_dir, "pr_f1.png"),
210
+ title="Precision-Recall curve for F1 score",
211
+ )
212
+ oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
213
+ pr_oracle = make_precision_recall_eval(
214
+ oracle_scores,
215
+ na_probs,
216
+ num_true_pos,
217
+ qid_to_has_ans,
218
+ out_image=os.path.join(out_image_dir, "pr_oracle.png"),
219
+ title="Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)",
220
+ )
221
+ merge_eval(main_eval, pr_exact, "pr_exact")
222
+ merge_eval(main_eval, pr_f1, "pr_f1")
223
+ merge_eval(main_eval, pr_oracle, "pr_oracle")
224
+
225
+
226
+ def histogram_na_prob(na_probs, qid_list, image_dir, name):
227
+ if not qid_list:
228
+ return
229
+ x = [na_probs[k] for k in qid_list]
230
+ weights = np.ones_like(x) / float(len(x))
231
+ plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
232
+ plt.xlabel("Model probability of no-answer")
233
+ plt.ylabel("Proportion of dataset")
234
+ plt.title(f"Histogram of no-answer probability: {name}")
235
+ plt.savefig(os.path.join(image_dir, f"na_prob_hist_{name}.png"))
236
+ plt.clf()
237
+
238
+
239
+ def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
240
+ num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
241
+ cur_score = num_no_ans
242
+ best_score = cur_score
243
+ best_thresh = 0.0
244
+ qid_list = sorted(na_probs, key=lambda k: na_probs[k])
245
+ for i, qid in enumerate(qid_list):
246
+ if qid not in scores:
247
+ continue
248
+ if qid_to_has_ans[qid]:
249
+ diff = scores[qid]
250
+ else:
251
+ if preds[qid]:
252
+ diff = -1
253
+ else:
254
+ diff = 0
255
+ cur_score += diff
256
+ if cur_score > best_score:
257
+ best_score = cur_score
258
+ best_thresh = na_probs[qid]
259
+ return 100.0 * best_score / len(scores), best_thresh
260
+
261
+
262
+ def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
263
+ best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)
264
+ best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)
265
+ main_eval["best_exact"] = best_exact
266
+ main_eval["best_exact_thresh"] = exact_thresh
267
+ main_eval["best_f1"] = best_f1
268
+ main_eval["best_f1_thresh"] = f1_thresh
269
+
270
+
271
+ def main():
272
+ with open(OPTS.data_file) as f:
273
+ dataset_json = json.load(f)
274
+ dataset = dataset_json["data"]
275
+ with open(OPTS.pred_file) as f:
276
+ preds = json.load(f)
277
+ if OPTS.na_prob_file:
278
+ with open(OPTS.na_prob_file) as f:
279
+ na_probs = json.load(f)
280
+ else:
281
+ na_probs = {k: 0.0 for k in preds}
282
+ qid_to_has_ans = make_qid_to_has_ans(dataset) # maps qid to True/False
283
+ has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
284
+ no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
285
+ exact_raw, f1_raw = get_raw_scores(dataset, preds)
286
+ exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, OPTS.na_prob_thresh)
287
+ f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, OPTS.na_prob_thresh)
288
+ out_eval = make_eval_dict(exact_thresh, f1_thresh)
289
+ if has_ans_qids:
290
+ has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
291
+ merge_eval(out_eval, has_ans_eval, "HasAns")
292
+ if no_ans_qids:
293
+ no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
294
+ merge_eval(out_eval, no_ans_eval, "NoAns")
295
+ if OPTS.na_prob_file:
296
+ find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans)
297
+ if OPTS.na_prob_file and OPTS.out_image_dir:
298
+ run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs, qid_to_has_ans, OPTS.out_image_dir)
299
+ histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, "hasAns")
300
+ histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, "noAns")
301
+ if OPTS.out_file:
302
+ with open(OPTS.out_file, "w") as f:
303
+ json.dump(out_eval, f)
304
+ else:
305
+ print(json.dumps(out_eval, indent=2))
306
+
307
+
308
+ if __name__ == "__main__":
309
+ OPTS = parse_args()
310
+ if OPTS.out_image_dir:
311
+ import matplotlib
312
+
313
+ matplotlib.use("Agg")
314
+ import matplotlib.pyplot as plt
315
+ main()
klue_mrc.py CHANGED
@@ -11,85 +11,132 @@
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
- """TODO: Add a description here."""
15
 
16
- import evaluate
17
  import datasets
18
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- # TODO: Add BibTeX citation
21
- _CITATION = """\
22
- @InProceedings{huggingface:module,
23
- title = {A great new module},
24
- authors={huggingface, Inc.},
25
- year={2020}
 
 
 
 
 
26
  }
27
  """
28
 
29
- # TODO: Add description of the module here
30
- _DESCRIPTION = """\
31
- This new module is designed to solve this great ML task and is crafted with a lot of care.
32
- """
 
 
 
 
 
33
 
 
 
 
34
 
35
- # TODO: Add description of the arguments of the module here
36
  _KWARGS_DESCRIPTION = """
37
- Calculates how good are predictions given some references, using certain scores
38
  Args:
39
- predictions: list of predictions to score. Each predictions
40
- should be a string with tokens separated by spaces.
41
- references: list of reference for each prediction. Each
42
- reference should be a string with tokens separated by spaces.
 
 
 
 
 
43
  Returns:
44
- accuracy: description of the first score,
45
- another_score: description of the second score,
 
 
 
 
 
 
 
 
 
 
 
46
  Examples:
47
- Examples should be written in doctest format, and should illustrate how
48
- to use the function.
49
 
50
- >>> my_new_module = evaluate.load("my_new_module")
51
- >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
 
 
52
  >>> print(results)
53
- {'accuracy': 1.0}
54
  """
55
 
56
- # TODO: Define external resources urls if needed
57
- BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
58
-
59
-
60
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
61
  class KLUEMRC(evaluate.Metric):
62
- """TODO: Short description of my evaluation module."""
63
-
64
  def _info(self):
65
- # TODO: Specifies the evaluate.EvaluationModuleInfo object
66
- return evaluate.MetricInfo(
67
- # This is the description that will appear on the modules page.
68
- module_type="metric",
69
  description=_DESCRIPTION,
70
  citation=_CITATION,
71
  inputs_description=_KWARGS_DESCRIPTION,
72
- # This defines the format of each prediction and reference
73
- features=datasets.Features({
74
- 'predictions': datasets.Value('int64'),
75
- 'references': datasets.Value('int64'),
76
- }),
77
- # Homepage of the module for documentation
78
- homepage="http://module.homepage",
79
- # Additional links to the codebase or references
80
- codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
81
- reference_urls=["http://path.to.reference.url/new_module"]
 
 
 
 
 
 
 
 
82
  )
83
 
84
- def _download_and_prepare(self, dl_manager):
85
- """Optional: download external resources useful to compute the scores"""
86
- # TODO: Download external resources if needed
87
- pass
88
-
89
- def _compute(self, predictions, references):
90
- """Returns the scores"""
91
- # TODO: Compute the different scores of the module
92
- accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
93
- return {
94
- "accuracy": accuracy,
95
- }
 
 
 
 
 
 
 
 
 
 
 
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
+ """ KLUE-MRC metric. """
15
 
 
16
  import datasets
17
 
18
+ import evaluate
19
+
20
+ from .compute_score import (
21
+ apply_no_ans_threshold,
22
+ find_all_best_thresh,
23
+ get_raw_scores,
24
+ make_eval_dict,
25
+ make_qid_to_has_ans,
26
+ merge_eval,
27
+ )
28
+
29
 
30
+ _CITATION = """
31
+ @inproceedings{NEURIPS DATASETS AND BENCHMARKS2021_98dce83d,
32
+ author = {Park, Sungjoon and Moon, Jihyung and Kim, Sungdong and Cho, Won Ik and Han, Ji Yoon and Park, Jangwon and Song, Chisung and Kim, Junseong and Song, Youngsook and Oh, Taehwan and Lee, Joohong and Oh, Juhyun and Lyu, Sungwon and Jeong, Younghoon and Lee, Inkwon and Seo, Sangwoo and Lee, Dongjun and Kim, Hyunwoo and Lee, Myeonghwa and Jang, Seongbo and Do, Seungwon and Kim, Sunkyoung and Lim, Kyungtae and Lee, Jongwon and Park, Kyumin and Shin, Jamin and Kim, Seonghyun and Park, Lucy and Park, Lucy and Oh, Alice and Ha (NAVER AI Lab), Jung-Woo and Cho, Kyunghyun and Cho, Kyunghyun},
33
+ booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
34
+ editor = {J. Vanschoren and S. Yeung},
35
+ pages = {},
36
+ publisher = {Curran},
37
+ title = {KLUE: Korean Language Understanding Evaluation},
38
+ url = {https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/98dce83da57b0395e163467c9dae521b-Paper-round2.pdf},
39
+ volume = {1},
40
+ year = {2021}
41
  }
42
  """
43
 
44
+ _DESCRIPTION = """
45
+ This metric wrap the unofficial scoring script for Machine Machine Reading Comprehension task of
46
+ Korean Language Understanding Evaluation (KLUE-MRC).
47
+
48
+ KLUE-MRC is a Korean reading comprehension dataset consisting of questionswhere the answer to every
49
+ question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
50
+
51
+ As KLUE-MRC has the same task format as SQuAD 2.0, this evaluation script uses
52
+ the same metrics of SQuAD 2.0 (F1 and EM).
53
 
54
+ KLUE-MRC consists of 12,286 question paraphrasing, 7,931 multi-sentence reasoning, and 9,269 unanswerable questions.
55
+ Totally, 29,313 examples are made with 22,343 documents and 23,717 passages.
56
+ """
57
 
 
58
  _KWARGS_DESCRIPTION = """
59
+ Computes KLUE-MRC scores (F1 and EM).
60
  Args:
61
+ predictions: List of triple for question-answers to score with the following elements:
62
+ - the question-answer 'id' field as given in the references (see below)
63
+ - the text of the answer
64
+ - the probability that the question has no answer
65
+ references: List of question-answers dictionaries with the following key-values:
66
+ - 'id': id of the question-answer pair (see above),
67
+ - 'answers': a list of Dict {'text': text of the answer as a string}
68
+ no_answer_threshold: float
69
+ Probability threshold to decide that a question has no answer.
70
  Returns:
71
+ 'exact': Exact match (the normalized answer exactly match the gold answer)
72
+ 'f1': The F-score of predicted tokens versus the gold answer
73
+ 'total': Number of score considered
74
+ 'HasAns_exact': Exact match (the normalized answer exactly match the gold answer)
75
+ 'HasAns_f1': The F-score of predicted tokens versus the gold answer
76
+ 'HasAns_total': Number of score considered
77
+ 'NoAns_exact': Exact match (the normalized answer exactly match the gold answer)
78
+ 'NoAns_f1': The F-score of predicted tokens versus the gold answer
79
+ 'NoAns_total': Number of score considered
80
+ 'best_exact': Best exact match (with varying threshold)
81
+ 'best_exact_thresh': No-answer probability threshold associated to the best exact match
82
+ 'best_f1': Best F1 (with varying threshold)
83
+ 'best_f1_thresh': No-answer probability threshold associated to the best F1
84
  Examples:
 
 
85
 
86
+ >>> predictions = [{'prediction_text': '2020', 'id': 'klue-mrc-v1_train_12311', 'no_answer_probability': 0.}]
87
+ >>> references = [{'id': 'klue-mrc-v1_train_12311', 'answers': { "answer_start": [ 38 ], "text": [ "2020" ] }, 'unanswerable': False}]
88
+ >>> klue_mrc_metric = evaluate.load("ingyu/klue_mrc")
89
+ >>> results = klue_mrc_metric.compute(predictions=predictions, references=references)
90
  >>> print(results)
91
+ {'exact': 100.0, 'f1': 100.0, 'total': 1, 'HasAns_exact': 100.0, 'HasAns_f1': 100.0, 'HasAns_total': 1, 'best_exact': 100.0, 'best_exact_thresh': 0.0, 'best_f1': 100.0, 'best_f1_thresh': 0.0}
92
  """
93
 
 
 
 
 
94
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
95
  class KLUEMRC(evaluate.Metric):
 
 
96
  def _info(self):
97
+ return datasets.MetricInfo(
 
 
 
98
  description=_DESCRIPTION,
99
  citation=_CITATION,
100
  inputs_description=_KWARGS_DESCRIPTION,
101
+ features=datasets.Features(
102
+ {
103
+ "predictions": {
104
+ "id": datasets.Value("string"),
105
+ "prediction_text": datasets.Value("string"),
106
+ "no_answer_probability": datasets.Value("float32"),
107
+ },
108
+ "references": {
109
+ "id": datasets.Value("string"),
110
+ "answers": datasets.features.Sequence(
111
+ {"text": datasets.Value("string"), "answer_start": datasets.Value("int32")}
112
+ ),
113
+ "unanswerable": datasets.Value("bool"),
114
+ },
115
+ }
116
+ ),
117
+ codebase_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
118
+ reference_urls=["https://klue-benchmark.com/tasks/72/overview/description"],
119
  )
120
 
121
+ def _compute(self, predictions, references, no_answer_threshold=1.0):
122
+ no_answer_probabilities = {p["id"]: p["no_answer_probability"] for p in predictions}
123
+ dataset = [{"paragraphs": [{"qas": references}]}]
124
+ predictions = {p["id"]: p["prediction_text"] for p in predictions}
125
+
126
+ qid_to_has_ans = make_qid_to_has_ans(dataset) # maps qid to True/False
127
+ has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
128
+ no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
129
+
130
+ exact_raw, f1_raw = get_raw_scores(dataset, predictions)
131
+ exact_thresh = apply_no_ans_threshold(exact_raw, no_answer_probabilities, qid_to_has_ans, no_answer_threshold)
132
+ f1_thresh = apply_no_ans_threshold(f1_raw, no_answer_probabilities, qid_to_has_ans, no_answer_threshold)
133
+ out_eval = make_eval_dict(exact_thresh, f1_thresh)
134
+
135
+ if has_ans_qids:
136
+ has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
137
+ merge_eval(out_eval, has_ans_eval, "HasAns")
138
+ if no_ans_qids:
139
+ no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
140
+ merge_eval(out_eval, no_ans_eval, "NoAns")
141
+ find_all_best_thresh(out_eval, predictions, exact_raw, f1_raw, no_answer_probabilities, qid_to_has_ans)
142
+ return dict(out_eval)
tests.py DELETED
@@ -1,17 +0,0 @@
1
- test_cases = [
2
- {
3
- "predictions": [0, 0],
4
- "references": [1, 1],
5
- "result": {"metric_score": 0}
6
- },
7
- {
8
- "predictions": [1, 1],
9
- "references": [1, 1],
10
- "result": {"metric_score": 1}
11
- },
12
- {
13
- "predictions": [1, 0],
14
- "references": [1, 1],
15
- "result": {"metric_score": 0.5}
16
- }
17
- ]