lvwerra HF staff commited on
Commit
0918cc9
1 Parent(s): 459cbf6

Update Space (evaluate main: 828c6327)

Browse files
Files changed (5) hide show
  1. README.md +124 -5
  2. app.py +6 -0
  3. compute_score.py +323 -0
  4. requirements.txt +3 -0
  5. squad_v2.py +137 -0
README.md CHANGED
@@ -1,12 +1,131 @@
1
  ---
2
- title: Squad_v2
3
- emoji: 🐨
4
- colorFrom: indigo
5
- colorTo: gray
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: SQuAD v2
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for SQuAD v2
16
+
17
+ ## Metric description
18
+ This metric wraps the official scoring script for version 2 of the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad_v2).
19
+
20
+ SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
21
+
22
+ SQuAD 2.0 combines the 100,000 questions in SQuAD 1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
23
+
24
+ ## How to use
25
+
26
+ The metric takes two files or two lists - one representing model predictions and the other the references to compare them to.
27
+
28
+ *Predictions* : List of triple for question-answers to score with the following key-value pairs:
29
+ * `'id'`: the question-answer identification field of the question and answer pair
30
+ * `'prediction_text'` : the text of the answer
31
+ * `'no_answer_probability'` : the probability that the question has no answer
32
+
33
+ *References*: List of question-answers dictionaries with the following key-value pairs:
34
+ * `'id'`: id of the question-answer pair (see above),
35
+ * `'answers'`: a list of Dict {'text': text of the answer as a string}
36
+ * `'no_answer_threshold'`: the probability threshold to decide that a question has no answer.
37
+
38
+ ```python
39
+ from evaluate import load
40
+ squad_metric = load("squad_v2")
41
+ results = squad_metric.compute(predictions=predictions, references=references)
42
+ ```
43
+ ## Output values
44
+
45
+ This metric outputs a dictionary with 13 values:
46
+ * `'exact'`: Exact match (the normalized answer exactly match the gold answer) (see the `exact_match` metric (forthcoming))
47
+ * `'f1'`: The average F1-score of predicted tokens versus the gold answer (see the [F1 score](https://huggingface.co/metrics/f1) metric)
48
+ * `'total'`: Number of scores considered
49
+ * `'HasAns_exact'`: Exact match (the normalized answer exactly match the gold answer)
50
+ * `'HasAns_f1'`: The F-score of predicted tokens versus the gold answer
51
+ * `'HasAns_total'`: How many of the questions have answers
52
+ * `'NoAns_exact'`: Exact match (the normalized answer exactly match the gold answer)
53
+ * `'NoAns_f1'`: The F-score of predicted tokens versus the gold answer
54
+ * `'NoAns_total'`: How many of the questions have no answers
55
+ * `'best_exact'` : Best exact match (with varying threshold)
56
+ * `'best_exact_thresh'`: No-answer probability threshold associated to the best exact match
57
+ * `'best_f1'`: Best F1 score (with varying threshold)
58
+ * `'best_f1_thresh'`: No-answer probability threshold associated to the best F1
59
+
60
+
61
+ The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched.
62
+
63
+ The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
64
+
65
+ The range of `total` depends on the length of predictions/references: its minimal value is 0, and maximal value is the total number of questions in the predictions and references.
66
+
67
+ ### Values from popular papers
68
+ The [SQuAD v2 paper](https://arxiv.org/pdf/1806.03822.pdf) reported an F1 score of 66.3% and an Exact Match score of 63.4%.
69
+ They also report that human performance on the dataset represents an F1 score of 89.5% and an Exact Match score of 86.9%.
70
+
71
+ For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/squad).
72
+
73
+ ## Examples
74
+
75
+ Maximal values for both exact match and F1 (perfect match):
76
+
77
+ ```python
78
+ from evaluate import load
79
+ squad_v2_ metric = load("squad_v2")
80
+ predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 0.}]
81
+ references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
82
+ results = squad_v2_metric.compute(predictions=predictions, references=references)
83
+ results
84
+ {'exact': 100.0, 'f1': 100.0, 'total': 1, 'HasAns_exact': 100.0, 'HasAns_f1': 100.0, 'HasAns_total': 1, 'best_exact': 100.0, 'best_exact_thresh': 0.0, 'best_f1': 100.0, 'best_f1_thresh': 0.0}
85
+ ```
86
+
87
+ Minimal values for both exact match and F1 (no match):
88
+
89
+ ```python
90
+ from evaluate import load
91
+ squad_metric = load("squad_v2")
92
+ predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 0.}]
93
+ references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
94
+ results = squad_v2_metric.compute(predictions=predictions, references=references)
95
+ results
96
+ {'exact': 0.0, 'f1': 0.0, 'total': 1, 'HasAns_exact': 0.0, 'HasAns_f1': 0.0, 'HasAns_total': 1, 'best_exact': 0.0, 'best_exact_thresh': 0.0, 'best_f1': 0.0, 'best_f1_thresh': 0.0}
97
+ ```
98
+
99
+ Partial match (2 out of 3 answers correct) :
100
+
101
+ ```python
102
+ from evaluate import load
103
+ squad_metric = load("squad_v2")
104
+ predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 0.}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b', 'no_answer_probability': 0.}, {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1', 'no_answer_probability': 0.}]
105
+ references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}]
106
+ results = squad_v2_metric.compute(predictions=predictions, references=references)
107
+ results
108
+ {'exact': 66.66666666666667, 'f1': 66.66666666666667, 'total': 3, 'HasAns_exact': 66.66666666666667, 'HasAns_f1': 66.66666666666667, 'HasAns_total': 3, 'best_exact': 66.66666666666667, 'best_exact_thresh': 0.0, 'best_f1': 66.66666666666667, 'best_f1_thresh': 0.0}
109
+ ```
110
+
111
+ ## Limitations and bias
112
+ This metric works only with the datasets in the same format as the [SQuAD v.2 dataset](https://huggingface.co/datasets/squad_v2).
113
+
114
+ The SQuAD datasets do contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers.
115
+
116
+
117
+ ## Citation
118
+
119
+ ```bibtex
120
+ @inproceedings{Rajpurkar2018SQuAD2,
121
+ title={Know What You Don't Know: Unanswerable Questions for SQuAD},
122
+ author={Pranav Rajpurkar and Jian Zhang and Percy Liang},
123
+ booktitle={ACL 2018},
124
+ year={2018}
125
+ }
126
+ ```
127
+
128
+ ## Further References
129
+
130
+ - [The Stanford Question Answering Dataset: Background, Challenges, Progress (blog post)](https://rajpurkar.github.io/mlx/qa-and-squad/)
131
+ - [Hugging Face Course -- Question Answering](https://huggingface.co/course/chapter7/7)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("squad_v2")
6
+ launch_gradio_widget(module)
compute_score.py ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Official evaluation script for SQuAD version 2.0.
2
+
3
+ In addition to basic functionality, we also compute additional statistics and
4
+ plot precision-recall curves if an additional na_prob.json file is provided.
5
+ This file is expected to map question ID's to the model's predicted probability
6
+ that a question is unanswerable.
7
+ """
8
+ import argparse
9
+ import collections
10
+ import json
11
+ import os
12
+ import re
13
+ import string
14
+ import sys
15
+
16
+ import numpy as np
17
+
18
+
19
+ ARTICLES_REGEX = re.compile(r"\b(a|an|the)\b", re.UNICODE)
20
+
21
+ OPTS = None
22
+
23
+
24
+ def parse_args():
25
+ parser = argparse.ArgumentParser("Official evaluation script for SQuAD version 2.0.")
26
+ parser.add_argument("data_file", metavar="data.json", help="Input data JSON file.")
27
+ parser.add_argument("pred_file", metavar="pred.json", help="Model predictions.")
28
+ parser.add_argument(
29
+ "--out-file", "-o", metavar="eval.json", help="Write accuracy metrics to file (default is stdout)."
30
+ )
31
+ parser.add_argument(
32
+ "--na-prob-file", "-n", metavar="na_prob.json", help="Model estimates of probability of no answer."
33
+ )
34
+ parser.add_argument(
35
+ "--na-prob-thresh",
36
+ "-t",
37
+ type=float,
38
+ default=1.0,
39
+ help='Predict "" if no-answer probability exceeds this (default = 1.0).',
40
+ )
41
+ parser.add_argument(
42
+ "--out-image-dir", "-p", metavar="out_images", default=None, help="Save precision-recall curves to directory."
43
+ )
44
+ parser.add_argument("--verbose", "-v", action="store_true")
45
+ if len(sys.argv) == 1:
46
+ parser.print_help()
47
+ sys.exit(1)
48
+ return parser.parse_args()
49
+
50
+
51
+ def make_qid_to_has_ans(dataset):
52
+ qid_to_has_ans = {}
53
+ for article in dataset:
54
+ for p in article["paragraphs"]:
55
+ for qa in p["qas"]:
56
+ qid_to_has_ans[qa["id"]] = bool(qa["answers"]["text"])
57
+ return qid_to_has_ans
58
+
59
+
60
+ def normalize_answer(s):
61
+ """Lower text and remove punctuation, articles and extra whitespace."""
62
+
63
+ def remove_articles(text):
64
+ return ARTICLES_REGEX.sub(" ", text)
65
+
66
+ def white_space_fix(text):
67
+ return " ".join(text.split())
68
+
69
+ def remove_punc(text):
70
+ exclude = set(string.punctuation)
71
+ return "".join(ch for ch in text if ch not in exclude)
72
+
73
+ def lower(text):
74
+ return text.lower()
75
+
76
+ return white_space_fix(remove_articles(remove_punc(lower(s))))
77
+
78
+
79
+ def get_tokens(s):
80
+ if not s:
81
+ return []
82
+ return normalize_answer(s).split()
83
+
84
+
85
+ def compute_exact(a_gold, a_pred):
86
+ return int(normalize_answer(a_gold) == normalize_answer(a_pred))
87
+
88
+
89
+ def compute_f1(a_gold, a_pred):
90
+ gold_toks = get_tokens(a_gold)
91
+ pred_toks = get_tokens(a_pred)
92
+ common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
93
+ num_same = sum(common.values())
94
+ if len(gold_toks) == 0 or len(pred_toks) == 0:
95
+ # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
96
+ return int(gold_toks == pred_toks)
97
+ if num_same == 0:
98
+ return 0
99
+ precision = 1.0 * num_same / len(pred_toks)
100
+ recall = 1.0 * num_same / len(gold_toks)
101
+ f1 = (2 * precision * recall) / (precision + recall)
102
+ return f1
103
+
104
+
105
+ def get_raw_scores(dataset, preds):
106
+ exact_scores = {}
107
+ f1_scores = {}
108
+ for article in dataset:
109
+ for p in article["paragraphs"]:
110
+ for qa in p["qas"]:
111
+ qid = qa["id"]
112
+ gold_answers = [t for t in qa["answers"]["text"] if normalize_answer(t)]
113
+ if not gold_answers:
114
+ # For unanswerable questions, only correct answer is empty string
115
+ gold_answers = [""]
116
+ if qid not in preds:
117
+ print(f"Missing prediction for {qid}")
118
+ continue
119
+ a_pred = preds[qid]
120
+ # Take max over all gold answers
121
+ exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
122
+ f1_scores[qid] = max(compute_f1(a, a_pred) for a in gold_answers)
123
+ return exact_scores, f1_scores
124
+
125
+
126
+ def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
127
+ new_scores = {}
128
+ for qid, s in scores.items():
129
+ pred_na = na_probs[qid] > na_prob_thresh
130
+ if pred_na:
131
+ new_scores[qid] = float(not qid_to_has_ans[qid])
132
+ else:
133
+ new_scores[qid] = s
134
+ return new_scores
135
+
136
+
137
+ def make_eval_dict(exact_scores, f1_scores, qid_list=None):
138
+ if not qid_list:
139
+ total = len(exact_scores)
140
+ return collections.OrderedDict(
141
+ [
142
+ ("exact", 100.0 * sum(exact_scores.values()) / total),
143
+ ("f1", 100.0 * sum(f1_scores.values()) / total),
144
+ ("total", total),
145
+ ]
146
+ )
147
+ else:
148
+ total = len(qid_list)
149
+ return collections.OrderedDict(
150
+ [
151
+ ("exact", 100.0 * sum(exact_scores[k] for k in qid_list) / total),
152
+ ("f1", 100.0 * sum(f1_scores[k] for k in qid_list) / total),
153
+ ("total", total),
154
+ ]
155
+ )
156
+
157
+
158
+ def merge_eval(main_eval, new_eval, prefix):
159
+ for k in new_eval:
160
+ main_eval[f"{prefix}_{k}"] = new_eval[k]
161
+
162
+
163
+ def plot_pr_curve(precisions, recalls, out_image, title):
164
+ plt.step(recalls, precisions, color="b", alpha=0.2, where="post")
165
+ plt.fill_between(recalls, precisions, step="post", alpha=0.2, color="b")
166
+ plt.xlabel("Recall")
167
+ plt.ylabel("Precision")
168
+ plt.xlim([0.0, 1.05])
169
+ plt.ylim([0.0, 1.05])
170
+ plt.title(title)
171
+ plt.savefig(out_image)
172
+ plt.clf()
173
+
174
+
175
+ def make_precision_recall_eval(scores, na_probs, num_true_pos, qid_to_has_ans, out_image=None, title=None):
176
+ qid_list = sorted(na_probs, key=lambda k: na_probs[k])
177
+ true_pos = 0.0
178
+ cur_p = 1.0
179
+ cur_r = 0.0
180
+ precisions = [1.0]
181
+ recalls = [0.0]
182
+ avg_prec = 0.0
183
+ for i, qid in enumerate(qid_list):
184
+ if qid_to_has_ans[qid]:
185
+ true_pos += scores[qid]
186
+ cur_p = true_pos / float(i + 1)
187
+ cur_r = true_pos / float(num_true_pos)
188
+ if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i + 1]]:
189
+ # i.e., if we can put a threshold after this point
190
+ avg_prec += cur_p * (cur_r - recalls[-1])
191
+ precisions.append(cur_p)
192
+ recalls.append(cur_r)
193
+ if out_image:
194
+ plot_pr_curve(precisions, recalls, out_image, title)
195
+ return {"ap": 100.0 * avg_prec}
196
+
197
+
198
+ def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs, qid_to_has_ans, out_image_dir):
199
+ if out_image_dir and not os.path.exists(out_image_dir):
200
+ os.makedirs(out_image_dir)
201
+ num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
202
+ if num_true_pos == 0:
203
+ return
204
+ pr_exact = make_precision_recall_eval(
205
+ exact_raw,
206
+ na_probs,
207
+ num_true_pos,
208
+ qid_to_has_ans,
209
+ out_image=os.path.join(out_image_dir, "pr_exact.png"),
210
+ title="Precision-Recall curve for Exact Match score",
211
+ )
212
+ pr_f1 = make_precision_recall_eval(
213
+ f1_raw,
214
+ na_probs,
215
+ num_true_pos,
216
+ qid_to_has_ans,
217
+ out_image=os.path.join(out_image_dir, "pr_f1.png"),
218
+ title="Precision-Recall curve for F1 score",
219
+ )
220
+ oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
221
+ pr_oracle = make_precision_recall_eval(
222
+ oracle_scores,
223
+ na_probs,
224
+ num_true_pos,
225
+ qid_to_has_ans,
226
+ out_image=os.path.join(out_image_dir, "pr_oracle.png"),
227
+ title="Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)",
228
+ )
229
+ merge_eval(main_eval, pr_exact, "pr_exact")
230
+ merge_eval(main_eval, pr_f1, "pr_f1")
231
+ merge_eval(main_eval, pr_oracle, "pr_oracle")
232
+
233
+
234
+ def histogram_na_prob(na_probs, qid_list, image_dir, name):
235
+ if not qid_list:
236
+ return
237
+ x = [na_probs[k] for k in qid_list]
238
+ weights = np.ones_like(x) / float(len(x))
239
+ plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
240
+ plt.xlabel("Model probability of no-answer")
241
+ plt.ylabel("Proportion of dataset")
242
+ plt.title(f"Histogram of no-answer probability: {name}")
243
+ plt.savefig(os.path.join(image_dir, f"na_prob_hist_{name}.png"))
244
+ plt.clf()
245
+
246
+
247
+ def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
248
+ num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
249
+ cur_score = num_no_ans
250
+ best_score = cur_score
251
+ best_thresh = 0.0
252
+ qid_list = sorted(na_probs, key=lambda k: na_probs[k])
253
+ for i, qid in enumerate(qid_list):
254
+ if qid not in scores:
255
+ continue
256
+ if qid_to_has_ans[qid]:
257
+ diff = scores[qid]
258
+ else:
259
+ if preds[qid]:
260
+ diff = -1
261
+ else:
262
+ diff = 0
263
+ cur_score += diff
264
+ if cur_score > best_score:
265
+ best_score = cur_score
266
+ best_thresh = na_probs[qid]
267
+ return 100.0 * best_score / len(scores), best_thresh
268
+
269
+
270
+ def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
271
+ best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)
272
+ best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)
273
+ main_eval["best_exact"] = best_exact
274
+ main_eval["best_exact_thresh"] = exact_thresh
275
+ main_eval["best_f1"] = best_f1
276
+ main_eval["best_f1_thresh"] = f1_thresh
277
+
278
+
279
+ def main():
280
+ with open(OPTS.data_file) as f:
281
+ dataset_json = json.load(f)
282
+ dataset = dataset_json["data"]
283
+ with open(OPTS.pred_file) as f:
284
+ preds = json.load(f)
285
+ if OPTS.na_prob_file:
286
+ with open(OPTS.na_prob_file) as f:
287
+ na_probs = json.load(f)
288
+ else:
289
+ na_probs = {k: 0.0 for k in preds}
290
+ qid_to_has_ans = make_qid_to_has_ans(dataset) # maps qid to True/False
291
+ has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
292
+ no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
293
+ exact_raw, f1_raw = get_raw_scores(dataset, preds)
294
+ exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, OPTS.na_prob_thresh)
295
+ f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, OPTS.na_prob_thresh)
296
+ out_eval = make_eval_dict(exact_thresh, f1_thresh)
297
+ if has_ans_qids:
298
+ has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
299
+ merge_eval(out_eval, has_ans_eval, "HasAns")
300
+ if no_ans_qids:
301
+ no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
302
+ merge_eval(out_eval, no_ans_eval, "NoAns")
303
+ if OPTS.na_prob_file:
304
+ find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans)
305
+ if OPTS.na_prob_file and OPTS.out_image_dir:
306
+ run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs, qid_to_has_ans, OPTS.out_image_dir)
307
+ histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, "hasAns")
308
+ histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, "noAns")
309
+ if OPTS.out_file:
310
+ with open(OPTS.out_file, "w") as f:
311
+ json.dump(out_eval, f)
312
+ else:
313
+ print(json.dumps(out_eval, indent=2))
314
+
315
+
316
+ if __name__ == "__main__":
317
+ OPTS = parse_args()
318
+ if OPTS.out_image_dir:
319
+ import matplotlib
320
+
321
+ matplotlib.use("Agg")
322
+ import matplotlib.pyplot as plt
323
+ main()
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
squad_v2.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ SQuAD v2 metric. """
15
+
16
+ import datasets
17
+
18
+ import evaluate
19
+
20
+ from .compute_score import (
21
+ apply_no_ans_threshold,
22
+ find_all_best_thresh,
23
+ get_raw_scores,
24
+ make_eval_dict,
25
+ make_qid_to_has_ans,
26
+ merge_eval,
27
+ )
28
+
29
+
30
+ _CITATION = """\
31
+ @inproceedings{Rajpurkar2016SQuAD10,
32
+ title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
33
+ author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
34
+ booktitle={EMNLP},
35
+ year={2016}
36
+ }
37
+ """
38
+
39
+ _DESCRIPTION = """
40
+ This metric wrap the official scoring script for version 2 of the Stanford Question
41
+ Answering Dataset (SQuAD).
42
+
43
+ Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
44
+ crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
45
+ from the corresponding reading passage, or the question might be unanswerable.
46
+
47
+ SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions
48
+ written adversarially by crowdworkers to look similar to answerable ones.
49
+ To do well on SQuAD2.0, systems must not only answer questions when possible, but also
50
+ determine when no answer is supported by the paragraph and abstain from answering.
51
+ """
52
+
53
+ _KWARGS_DESCRIPTION = """
54
+ Computes SQuAD v2 scores (F1 and EM).
55
+ Args:
56
+ predictions: List of triple for question-answers to score with the following elements:
57
+ - the question-answer 'id' field as given in the references (see below)
58
+ - the text of the answer
59
+ - the probability that the question has no answer
60
+ references: List of question-answers dictionaries with the following key-values:
61
+ - 'id': id of the question-answer pair (see above),
62
+ - 'answers': a list of Dict {'text': text of the answer as a string}
63
+ no_answer_threshold: float
64
+ Probability threshold to decide that a question has no answer.
65
+ Returns:
66
+ 'exact': Exact match (the normalized answer exactly match the gold answer)
67
+ 'f1': The F-score of predicted tokens versus the gold answer
68
+ 'total': Number of score considered
69
+ 'HasAns_exact': Exact match (the normalized answer exactly match the gold answer)
70
+ 'HasAns_f1': The F-score of predicted tokens versus the gold answer
71
+ 'HasAns_total': Number of score considered
72
+ 'NoAns_exact': Exact match (the normalized answer exactly match the gold answer)
73
+ 'NoAns_f1': The F-score of predicted tokens versus the gold answer
74
+ 'NoAns_total': Number of score considered
75
+ 'best_exact': Best exact match (with varying threshold)
76
+ 'best_exact_thresh': No-answer probability threshold associated to the best exact match
77
+ 'best_f1': Best F1 (with varying threshold)
78
+ 'best_f1_thresh': No-answer probability threshold associated to the best F1
79
+ Examples:
80
+
81
+ >>> predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 0.}]
82
+ >>> references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
83
+ >>> squad_v2_metric = evaluate.load("squad_v2")
84
+ >>> results = squad_v2_metric.compute(predictions=predictions, references=references)
85
+ >>> print(results)
86
+ {'exact': 100.0, 'f1': 100.0, 'total': 1, 'HasAns_exact': 100.0, 'HasAns_f1': 100.0, 'HasAns_total': 1, 'best_exact': 100.0, 'best_exact_thresh': 0.0, 'best_f1': 100.0, 'best_f1_thresh': 0.0}
87
+ """
88
+
89
+
90
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
91
+ class SquadV2(evaluate.EvaluationModule):
92
+ def _info(self):
93
+ return evaluate.EvaluationModuleInfo(
94
+ description=_DESCRIPTION,
95
+ citation=_CITATION,
96
+ inputs_description=_KWARGS_DESCRIPTION,
97
+ features=datasets.Features(
98
+ {
99
+ "predictions": {
100
+ "id": datasets.Value("string"),
101
+ "prediction_text": datasets.Value("string"),
102
+ "no_answer_probability": datasets.Value("float32"),
103
+ },
104
+ "references": {
105
+ "id": datasets.Value("string"),
106
+ "answers": datasets.features.Sequence(
107
+ {"text": datasets.Value("string"), "answer_start": datasets.Value("int32")}
108
+ ),
109
+ },
110
+ }
111
+ ),
112
+ codebase_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
113
+ reference_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
114
+ )
115
+
116
+ def _compute(self, predictions, references, no_answer_threshold=1.0):
117
+ no_answer_probabilities = {p["id"]: p["no_answer_probability"] for p in predictions}
118
+ dataset = [{"paragraphs": [{"qas": references}]}]
119
+ predictions = {p["id"]: p["prediction_text"] for p in predictions}
120
+
121
+ qid_to_has_ans = make_qid_to_has_ans(dataset) # maps qid to True/False
122
+ has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
123
+ no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
124
+
125
+ exact_raw, f1_raw = get_raw_scores(dataset, predictions)
126
+ exact_thresh = apply_no_ans_threshold(exact_raw, no_answer_probabilities, qid_to_has_ans, no_answer_threshold)
127
+ f1_thresh = apply_no_ans_threshold(f1_raw, no_answer_probabilities, qid_to_has_ans, no_answer_threshold)
128
+ out_eval = make_eval_dict(exact_thresh, f1_thresh)
129
+
130
+ if has_ans_qids:
131
+ has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
132
+ merge_eval(out_eval, has_ans_eval, "HasAns")
133
+ if no_ans_qids:
134
+ no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
135
+ merge_eval(out_eval, no_ans_eval, "NoAns")
136
+ find_all_best_thresh(out_eval, predictions, exact_raw, f1_raw, no_answer_probabilities, qid_to_has_ans)
137
+ return dict(out_eval)