omidf julien-c HF staff commited on
Commit
63e7fe5
0 Parent(s):

Duplicate from evaluate-metric/squad

Browse files

Co-authored-by: Julien Chaumond <julien-c@users.noreply.huggingface.co>

Files changed (6) hide show
  1. .gitattributes +27 -0
  2. README.md +113 -0
  3. app.py +6 -0
  4. compute_score.py +92 -0
  5. requirements.txt +1 -0
  6. squad.py +111 -0
.gitattributes ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ftz filter=lfs diff=lfs merge=lfs -text
6
+ *.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.h5 filter=lfs diff=lfs merge=lfs -text
8
+ *.joblib filter=lfs diff=lfs merge=lfs -text
9
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
10
+ *.model filter=lfs diff=lfs merge=lfs -text
11
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
12
+ *.onnx filter=lfs diff=lfs merge=lfs -text
13
+ *.ot filter=lfs diff=lfs merge=lfs -text
14
+ *.parquet filter=lfs diff=lfs merge=lfs -text
15
+ *.pb filter=lfs diff=lfs merge=lfs -text
16
+ *.pt filter=lfs diff=lfs merge=lfs -text
17
+ *.pth filter=lfs diff=lfs merge=lfs -text
18
+ *.rar filter=lfs diff=lfs merge=lfs -text
19
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
20
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
21
+ *.tflite filter=lfs diff=lfs merge=lfs -text
22
+ *.tgz filter=lfs diff=lfs merge=lfs -text
23
+ *.wasm filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: SQuAD
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 3.0.2
8
+ app_file: app.py
9
+ pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
+ description: >-
14
+ This metric wrap the official scoring script for version 1 of the Stanford
15
+ Question Answering Dataset (SQuAD).
16
+
17
+ Stanford Question Answering Dataset (SQuAD) is a reading comprehension
18
+ dataset, consisting of questions posed by crowdworkers on a set of Wikipedia
19
+ articles, where the answer to every question is a segment of text, or span,
20
+ from the corresponding reading passage, or the question might be unanswerable.
21
+ duplicated_from: evaluate-metric/squad
22
+ ---
23
+
24
+ # Metric Card for SQuAD
25
+
26
+ ## Metric description
27
+ This metric wraps the official scoring script for version 1 of the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad).
28
+
29
+ SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
30
+
31
+ ## How to use
32
+
33
+ The metric takes two files or two lists of question-answers dictionaries as inputs : one with the predictions of the model and the other with the references to be compared to:
34
+
35
+ ```python
36
+ from evaluate import load
37
+ squad_metric = load("squad")
38
+ results = squad_metric.compute(predictions=predictions, references=references)
39
+ ```
40
+ ## Output values
41
+
42
+ This metric outputs a dictionary with two values: the average exact match score and the average [F1 score](https://huggingface.co/metrics/f1).
43
+
44
+ ```
45
+ {'exact_match': 100.0, 'f1': 100.0}
46
+ ```
47
+
48
+ The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched.
49
+
50
+ The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
51
+
52
+ ### Values from popular papers
53
+ The [original SQuAD paper](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) reported an F1 score of 51.0% and an Exact Match score of 40.0%. They also report that human performance on the dataset represents an F1 score of 90.5% and an Exact Match score of 80.3%.
54
+
55
+ For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/squad).
56
+
57
+ ## Examples
58
+
59
+ Maximal values for both exact match and F1 (perfect match):
60
+
61
+ ```python
62
+ from evaluate import load
63
+ squad_metric = load("squad")
64
+ predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
65
+ references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
66
+ results = squad_metric.compute(predictions=predictions, references=references)
67
+ results
68
+ {'exact_match': 100.0, 'f1': 100.0}
69
+ ```
70
+
71
+ Minimal values for both exact match and F1 (no match):
72
+
73
+ ```python
74
+ from evaluate import load
75
+ squad_metric = load("squad")
76
+ predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22'}]
77
+ references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
78
+ results = squad_metric.compute(predictions=predictions, references=references)
79
+ results
80
+ {'exact_match': 0.0, 'f1': 0.0}
81
+ ```
82
+
83
+ Partial match (2 out of 3 answers correct) :
84
+
85
+ ```python
86
+ from evaluate import load
87
+ squad_metric = load("squad")
88
+ predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b'}, {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1'}]
89
+ references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}]
90
+ results = squad_metric.compute(predictions=predictions, references=references)
91
+ results
92
+ {'exact_match': 66.66666666666667, 'f1': 66.66666666666667}
93
+ ```
94
+
95
+ ## Limitations and bias
96
+ This metric works only with datasets that have the same format as [SQuAD v.1 dataset](https://huggingface.co/datasets/squad).
97
+
98
+ The SQuAD dataset does contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers.
99
+
100
+
101
+ ## Citation
102
+
103
+ @inproceedings{Rajpurkar2016SQuAD10,
104
+ title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
105
+ author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
106
+ booktitle={EMNLP},
107
+ year={2016}
108
+ }
109
+
110
+ ## Further References
111
+
112
+ - [The Stanford Question Answering Dataset: Background, Challenges, Progress (blog post)](https://rajpurkar.github.io/mlx/qa-and-squad/)
113
+ - [Hugging Face Course -- Question Answering](https://huggingface.co/course/chapter7/7)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("squad")
6
+ launch_gradio_widget(module)
compute_score.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ Official evaluation script for v1.1 of the SQuAD dataset. """
2
+
3
+ import argparse
4
+ import json
5
+ import re
6
+ import string
7
+ import sys
8
+ from collections import Counter
9
+
10
+
11
+ def normalize_answer(s):
12
+ """Lower text and remove punctuation, articles and extra whitespace."""
13
+
14
+ def remove_articles(text):
15
+ return re.sub(r"\b(a|an|the)\b", " ", text)
16
+
17
+ def white_space_fix(text):
18
+ return " ".join(text.split())
19
+
20
+ def remove_punc(text):
21
+ exclude = set(string.punctuation)
22
+ return "".join(ch for ch in text if ch not in exclude)
23
+
24
+ def lower(text):
25
+ return text.lower()
26
+
27
+ return white_space_fix(remove_articles(remove_punc(lower(s))))
28
+
29
+
30
+ def f1_score(prediction, ground_truth):
31
+ prediction_tokens = normalize_answer(prediction).split()
32
+ ground_truth_tokens = normalize_answer(ground_truth).split()
33
+ common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
34
+ num_same = sum(common.values())
35
+ if num_same == 0:
36
+ return 0
37
+ precision = 1.0 * num_same / len(prediction_tokens)
38
+ recall = 1.0 * num_same / len(ground_truth_tokens)
39
+ f1 = (2 * precision * recall) / (precision + recall)
40
+ return f1
41
+
42
+
43
+ def exact_match_score(prediction, ground_truth):
44
+ return normalize_answer(prediction) == normalize_answer(ground_truth)
45
+
46
+
47
+ def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
48
+ scores_for_ground_truths = []
49
+ for ground_truth in ground_truths:
50
+ score = metric_fn(prediction, ground_truth)
51
+ scores_for_ground_truths.append(score)
52
+ return max(scores_for_ground_truths)
53
+
54
+
55
+ def compute_score(dataset, predictions):
56
+ f1 = exact_match = total = 0
57
+ for article in dataset:
58
+ for paragraph in article["paragraphs"]:
59
+ for qa in paragraph["qas"]:
60
+ total += 1
61
+ if qa["id"] not in predictions:
62
+ message = "Unanswered question " + qa["id"] + " will receive score 0."
63
+ print(message, file=sys.stderr)
64
+ continue
65
+ ground_truths = list(map(lambda x: x["text"], qa["answers"]))
66
+ prediction = predictions[qa["id"]]
67
+ exact_match += metric_max_over_ground_truths(exact_match_score, prediction, ground_truths)
68
+ f1 += metric_max_over_ground_truths(f1_score, prediction, ground_truths)
69
+
70
+ exact_match = 100.0 * exact_match / total
71
+ f1 = 100.0 * f1 / total
72
+
73
+ return {"exact_match": exact_match, "f1": f1}
74
+
75
+
76
+ if __name__ == "__main__":
77
+ expected_version = "1.1"
78
+ parser = argparse.ArgumentParser(description="Evaluation for SQuAD " + expected_version)
79
+ parser.add_argument("dataset_file", help="Dataset file")
80
+ parser.add_argument("prediction_file", help="Prediction File")
81
+ args = parser.parse_args()
82
+ with open(args.dataset_file) as dataset_file:
83
+ dataset_json = json.load(dataset_file)
84
+ if dataset_json["version"] != expected_version:
85
+ print(
86
+ "Evaluation expects v-" + expected_version + ", but got dataset with v-" + dataset_json["version"],
87
+ file=sys.stderr,
88
+ )
89
+ dataset = dataset_json["data"]
90
+ with open(args.prediction_file) as prediction_file:
91
+ predictions = json.load(prediction_file)
92
+ print(json.dumps(compute_score(dataset, predictions)))
requirements.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ git+https://github.com/huggingface/evaluate@6abb0d53b82b1e5efea5d683b91d7990a653c78d
squad.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ SQuAD metric. """
15
+
16
+ import datasets
17
+
18
+ import evaluate
19
+
20
+ from .compute_score import compute_score
21
+
22
+
23
+ _CITATION = """\
24
+ @inproceedings{Rajpurkar2016SQuAD10,
25
+ title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
26
+ author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
27
+ booktitle={EMNLP},
28
+ year={2016}
29
+ }
30
+ """
31
+
32
+ _DESCRIPTION = """
33
+ This metric wrap the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD).
34
+
35
+ Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
36
+ crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
37
+ from the corresponding reading passage, or the question might be unanswerable.
38
+ """
39
+
40
+ _KWARGS_DESCRIPTION = """
41
+ Computes SQuAD scores (F1 and EM).
42
+ Args:
43
+ predictions: List of question-answers dictionaries with the following key-values:
44
+ - 'id': id of the question-answer pair as given in the references (see below)
45
+ - 'prediction_text': the text of the answer
46
+ references: List of question-answers dictionaries with the following key-values:
47
+ - 'id': id of the question-answer pair (see above),
48
+ - 'answers': a Dict in the SQuAD dataset format
49
+ {
50
+ 'text': list of possible texts for the answer, as a list of strings
51
+ 'answer_start': list of start positions for the answer, as a list of ints
52
+ }
53
+ Note that answer_start values are not taken into account to compute the metric.
54
+ Returns:
55
+ 'exact_match': Exact match (the normalized answer exactly match the gold answer)
56
+ 'f1': The F-score of predicted tokens versus the gold answer
57
+ Examples:
58
+
59
+ >>> predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
60
+ >>> references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
61
+ >>> squad_metric = evaluate.load("squad")
62
+ >>> results = squad_metric.compute(predictions=predictions, references=references)
63
+ >>> print(results)
64
+ {'exact_match': 100.0, 'f1': 100.0}
65
+ """
66
+
67
+
68
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
69
+ class Squad(evaluate.Metric):
70
+ def _info(self):
71
+ return evaluate.MetricInfo(
72
+ description=_DESCRIPTION,
73
+ citation=_CITATION,
74
+ inputs_description=_KWARGS_DESCRIPTION,
75
+ features=datasets.Features(
76
+ {
77
+ "predictions": {"id": datasets.Value("string"), "prediction_text": datasets.Value("string")},
78
+ "references": {
79
+ "id": datasets.Value("string"),
80
+ "answers": datasets.features.Sequence(
81
+ {
82
+ "text": datasets.Value("string"),
83
+ "answer_start": datasets.Value("int32"),
84
+ }
85
+ ),
86
+ },
87
+ }
88
+ ),
89
+ codebase_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
90
+ reference_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
91
+ )
92
+
93
+ def _compute(self, predictions, references):
94
+ pred_dict = {prediction["id"]: prediction["prediction_text"] for prediction in predictions}
95
+ dataset = [
96
+ {
97
+ "paragraphs": [
98
+ {
99
+ "qas": [
100
+ {
101
+ "answers": [{"text": answer_text} for answer_text in ref["answers"]["text"]],
102
+ "id": ref["id"],
103
+ }
104
+ for ref in references
105
+ ]
106
+ }
107
+ ]
108
+ }
109
+ ]
110
+ score = compute_score(dataset=dataset, predictions=pred_dict)
111
+ return score