lvwerra HF staff commited on
Commit
22a9bb8
1 Parent(s): 2831791

Update Space (evaluate main: 828c6327)

Browse files
Files changed (5) hide show
  1. README.md +119 -5
  2. app.py +6 -0
  3. compute_score.py +205 -0
  4. cuad.py +117 -0
  5. requirements.txt +3 -0
README.md CHANGED
@@ -1,12 +1,126 @@
1
  ---
2
- title: Cuad
3
- emoji: 🌍
4
- colorFrom: purple
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: CUAD
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for CUAD
16
+
17
+ ## Metric description
18
+
19
+ This metric wraps the official scoring script for version 1 of the [Contract Understanding Atticus Dataset (CUAD)](https://huggingface.co/datasets/cuad), which is a corpus of more than 13,000 labels in 510 commercial legal contracts that have been manually labeled to identify 41 categories of important clauses that lawyers look for when reviewing contracts in connection with corporate transactions.
20
+
21
+ The CUAD metric computes several scores: [Exact Match](https://huggingface.co/metrics/exact_match), [F1 score](https://huggingface.co/metrics/f1), Area Under the Precision-Recall Curve, [Precision](https://huggingface.co/metrics/precision) at 80% [recall](https://huggingface.co/metrics/recall) and Precision at 90% recall.
22
+
23
+ ## How to use
24
+
25
+ The CUAD metric takes two inputs :
26
+
27
+
28
+ `predictions`, a list of question-answer dictionaries with the following key-values:
29
+ - `id`: the id of the question-answer pair as given in the references.
30
+ - `prediction_text`: a list of possible texts for the answer, as a list of strings depending on a threshold on the confidence probability of each prediction.
31
+
32
+
33
+ `references`: a list of question-answer dictionaries with the following key-values:
34
+ - `id`: the id of the question-answer pair (the same as above).
35
+ - `answers`: a dictionary *in the CUAD dataset format* with the following keys:
36
+ - `text`: a list of possible texts for the answer, as a list of strings.
37
+ - `answer_start`: a list of start positions for the answer, as a list of ints.
38
+
39
+ Note that `answer_start` values are not taken into account to compute the metric.
40
+
41
+ ```python
42
+ from evaluate import load
43
+ cuad_metric = load("cuad")
44
+ predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
45
+ references = [{'answers': {'answer_start': [143, 49], 'text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.']}, 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
46
+ results = cuad_metric.compute(predictions=predictions, references=references)
47
+ ```
48
+ ## Output values
49
+
50
+ The output of the CUAD metric consists of a dictionary that contains one or several of the following metrics:
51
+
52
+ `exact_match`: The normalized answers that exactly match the reference answer, with a range between 0.0 and 1.0 (see [exact match](https://huggingface.co/metrics/exact_match) for more information).
53
+
54
+ `f1`: The harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is between 0.0 and 1.0 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
55
+
56
+ `aupr`: The Area Under the Precision-Recall curve, with a range between 0.0 and 1.0, with a higher value representing both high recall and high precision, and a low value representing low values for both. See the [Wikipedia article](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve) for more information.
57
+
58
+ `prec_at_80_recall`: The fraction of true examples among the predicted examples at a recall rate of 80%. Its range is between 0.0 and 1.0. For more information, see [precision](https://huggingface.co/metrics/precision) and [recall](https://huggingface.co/metrics/recall).
59
+
60
+ `prec_at_90_recall`: The fraction of true examples among the predicted examples at a recall rate of 90%. Its range is between 0.0 and 1.0.
61
+
62
+
63
+ ### Values from popular papers
64
+ The [original CUAD paper](https://arxiv.org/pdf/2103.06268.pdf) reports that a [DeBERTa model](https://huggingface.co/microsoft/deberta-base) attains
65
+ an AUPR of 47.8%, a Precision at 80% Recall of 44.0%, and a Precision at 90% Recall of 17.8% (they do not report F1 or Exact Match separately).
66
+
67
+ For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/cuad).
68
+
69
+ ## Examples
70
+
71
+ Maximal values :
72
+
73
+ ```python
74
+ from evaluate import load
75
+ cuad_metric = load("cuad")
76
+ predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
77
+ references = [{'answers': {'answer_start': [143, 49], 'text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.']}, 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
78
+ results = cuad_metric.compute(predictions=predictions, references=references)
79
+ print(results)
80
+ {'exact_match': 100.0, 'f1': 100.0, 'aupr': 0.0, 'prec_at_80_recall': 1.0, 'prec_at_90_recall': 1.0}
81
+ ```
82
+
83
+ Minimal values:
84
+
85
+ ```python
86
+ from evaluate import load
87
+ cuad_metric = load("cuad")
88
+ predictions = [{'prediction_text': ['The Company appoints the Distributor as an exclusive distributor of Products in the Market, subject to the terms and conditions of this Agreement.'], 'id': 'LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT__Exclusivity_0'}]
89
+ references = [{'answers': {'answer_start': [143], 'text': 'The seller'}, 'id': 'LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT__Exclusivity_0'}]
90
+ results = cuad_metric.compute(predictions=predictions, references=references)
91
+ print(results)
92
+ {'exact_match': 0.0, 'f1': 0.0, 'aupr': 0.0, 'prec_at_80_recall': 0, 'prec_at_90_recall': 0}
93
+ ```
94
+
95
+ Partial match:
96
+
97
+ ```python
98
+ from evaluate import load
99
+ cuad_metric = load("cuad")
100
+ predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
101
+ predictions = [{'prediction_text': ['The Company appoints the Distributor as an exclusive distributor of Products in the Market, subject to the terms and conditions of this Agreement.', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
102
+ results = cuad_metric.compute(predictions=predictions, references=references)
103
+ print(results)
104
+ {'exact_match': 100.0, 'f1': 50.0, 'aupr': 0.0, 'prec_at_80_recall': 0, 'prec_at_90_recall': 0}
105
+ ```
106
+
107
+ ## Limitations and bias
108
+ This metric works only with datasets that have the same format as the [CUAD dataset](https://huggingface.co/datasets/cuad). The limitations of the biases of this dataset are not discussed, but could exhibit annotation bias given the homogeneity of annotators for this dataset.
109
+
110
+ In terms of the metric itself, the accuracy of AUPR has been debated because its estimates are quite noisy and because of the fact that reducing the Precision-Recall Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system. Reporting the original F1 and exact match scores is therefore useful to ensure a more complete representation of system performance.
111
+
112
+
113
+ ## Citation
114
+
115
+ ```bibtex
116
+ @article{hendrycks2021cuad,
117
+ title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review},
118
+ author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
119
+ journal={arXiv preprint arXiv:2103.06268},
120
+ year={2021}
121
+ }
122
+ ```
123
+
124
+ ## Further References
125
+
126
+ - [CUAD dataset homepage](https://www.atticusprojectai.org/cuad-v1-performance-announcements)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("cuad")
6
+ launch_gradio_widget(module)
compute_score.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ Official evaluation script for CUAD dataset. """
2
+
3
+ import argparse
4
+ import json
5
+ import re
6
+ import string
7
+ import sys
8
+
9
+ import numpy as np
10
+
11
+
12
+ IOU_THRESH = 0.5
13
+
14
+
15
+ def get_jaccard(prediction, ground_truth):
16
+ remove_tokens = [".", ",", ";", ":"]
17
+
18
+ for token in remove_tokens:
19
+ ground_truth = ground_truth.replace(token, "")
20
+ prediction = prediction.replace(token, "")
21
+
22
+ ground_truth, prediction = ground_truth.lower(), prediction.lower()
23
+ ground_truth, prediction = ground_truth.replace("/", " "), prediction.replace("/", " ")
24
+ ground_truth, prediction = set(ground_truth.split(" ")), set(prediction.split(" "))
25
+
26
+ intersection = ground_truth.intersection(prediction)
27
+ union = ground_truth.union(prediction)
28
+ jaccard = len(intersection) / len(union)
29
+ return jaccard
30
+
31
+
32
+ def normalize_answer(s):
33
+ """Lower text and remove punctuation, articles and extra whitespace."""
34
+
35
+ def remove_articles(text):
36
+ return re.sub(r"\b(a|an|the)\b", " ", text)
37
+
38
+ def white_space_fix(text):
39
+ return " ".join(text.split())
40
+
41
+ def remove_punc(text):
42
+ exclude = set(string.punctuation)
43
+ return "".join(ch for ch in text if ch not in exclude)
44
+
45
+ def lower(text):
46
+ return text.lower()
47
+
48
+ return white_space_fix(remove_articles(remove_punc(lower(s))))
49
+
50
+
51
+ def compute_precision_recall(predictions, ground_truths, qa_id):
52
+ tp, fp, fn = 0, 0, 0
53
+
54
+ substr_ok = "Parties" in qa_id
55
+
56
+ # first check if ground truth is empty
57
+ if len(ground_truths) == 0:
58
+ if len(predictions) > 0:
59
+ fp += len(predictions) # false positive for each one
60
+ else:
61
+ for ground_truth in ground_truths:
62
+ assert len(ground_truth) > 0
63
+ # check if there is a match
64
+ match_found = False
65
+ for pred in predictions:
66
+ if substr_ok:
67
+ is_match = get_jaccard(pred, ground_truth) >= IOU_THRESH or ground_truth in pred
68
+ else:
69
+ is_match = get_jaccard(pred, ground_truth) >= IOU_THRESH
70
+ if is_match:
71
+ match_found = True
72
+
73
+ if match_found:
74
+ tp += 1
75
+ else:
76
+ fn += 1
77
+
78
+ # now also get any fps by looping through preds
79
+ for pred in predictions:
80
+ # Check if there's a match. if so, don't count (don't want to double count based on the above)
81
+ # but if there's no match, then this is a false positive.
82
+ # (Note: we get the true positives in the above loop instead of this loop so that we don't double count
83
+ # multiple predictions that are matched with the same answer.)
84
+ match_found = False
85
+ for ground_truth in ground_truths:
86
+ assert len(ground_truth) > 0
87
+ if substr_ok:
88
+ is_match = get_jaccard(pred, ground_truth) >= IOU_THRESH or ground_truth in pred
89
+ else:
90
+ is_match = get_jaccard(pred, ground_truth) >= IOU_THRESH
91
+ if is_match:
92
+ match_found = True
93
+
94
+ if not match_found:
95
+ fp += 1
96
+
97
+ precision = tp / (tp + fp) if tp + fp > 0 else np.nan
98
+ recall = tp / (tp + fn) if tp + fn > 0 else np.nan
99
+
100
+ return precision, recall
101
+
102
+
103
+ def process_precisions(precisions):
104
+ """
105
+ Processes precisions to ensure that precision and recall don't both get worse.
106
+ Assumes the list precision is sorted in order of recalls
107
+ """
108
+ precision_best = precisions[::-1]
109
+ for i in range(1, len(precision_best)):
110
+ precision_best[i] = max(precision_best[i - 1], precision_best[i])
111
+ precisions = precision_best[::-1]
112
+ return precisions
113
+
114
+
115
+ def get_aupr(precisions, recalls):
116
+ processed_precisions = process_precisions(precisions)
117
+ aupr = np.trapz(processed_precisions, recalls)
118
+ if np.isnan(aupr):
119
+ return 0
120
+ return aupr
121
+
122
+
123
+ def get_prec_at_recall(precisions, recalls, recall_thresh):
124
+ """Assumes recalls are sorted in increasing order"""
125
+ processed_precisions = process_precisions(precisions)
126
+ prec_at_recall = 0
127
+ for prec, recall in zip(processed_precisions, recalls):
128
+ if recall >= recall_thresh:
129
+ prec_at_recall = prec
130
+ break
131
+ return prec_at_recall
132
+
133
+
134
+ def exact_match_score(prediction, ground_truth):
135
+ return normalize_answer(prediction) == normalize_answer(ground_truth)
136
+
137
+
138
+ def metric_max_over_ground_truths(metric_fn, predictions, ground_truths):
139
+ score = 0
140
+ for pred in predictions:
141
+ for ground_truth in ground_truths:
142
+ score = metric_fn(pred, ground_truth)
143
+ if score == 1: # break the loop when one prediction matches the ground truth
144
+ break
145
+ if score == 1:
146
+ break
147
+ return score
148
+
149
+
150
+ def compute_score(dataset, predictions):
151
+ f1 = exact_match = total = 0
152
+ precisions = []
153
+ recalls = []
154
+ for article in dataset:
155
+ for paragraph in article["paragraphs"]:
156
+ for qa in paragraph["qas"]:
157
+ total += 1
158
+ if qa["id"] not in predictions:
159
+ message = "Unanswered question " + qa["id"] + " will receive score 0."
160
+ print(message, file=sys.stderr)
161
+ continue
162
+ ground_truths = list(map(lambda x: x["text"], qa["answers"]))
163
+ prediction = predictions[qa["id"]]
164
+ precision, recall = compute_precision_recall(prediction, ground_truths, qa["id"])
165
+
166
+ precisions.append(precision)
167
+ recalls.append(recall)
168
+
169
+ if precision == 0 and recall == 0:
170
+ f1 += 0
171
+ else:
172
+ f1 += 2 * (precision * recall) / (precision + recall)
173
+
174
+ exact_match += metric_max_over_ground_truths(exact_match_score, prediction, ground_truths)
175
+
176
+ precisions = [x for _, x in sorted(zip(recalls, precisions))]
177
+ recalls.sort()
178
+
179
+ f1 = 100.0 * f1 / total
180
+ exact_match = 100.0 * exact_match / total
181
+ aupr = get_aupr(precisions, recalls)
182
+
183
+ prec_at_90_recall = get_prec_at_recall(precisions, recalls, recall_thresh=0.9)
184
+ prec_at_80_recall = get_prec_at_recall(precisions, recalls, recall_thresh=0.8)
185
+
186
+ return {
187
+ "exact_match": exact_match,
188
+ "f1": f1,
189
+ "aupr": aupr,
190
+ "prec_at_80_recall": prec_at_80_recall,
191
+ "prec_at_90_recall": prec_at_90_recall,
192
+ }
193
+
194
+
195
+ if __name__ == "__main__":
196
+ parser = argparse.ArgumentParser(description="Evaluation for CUAD")
197
+ parser.add_argument("dataset_file", help="Dataset file")
198
+ parser.add_argument("prediction_file", help="Prediction File")
199
+ args = parser.parse_args()
200
+ with open(args.dataset_file) as dataset_file:
201
+ dataset_json = json.load(dataset_file)
202
+ dataset = dataset_json["data"]
203
+ with open(args.prediction_file) as prediction_file:
204
+ predictions = json.load(prediction_file)
205
+ print(json.dumps(compute_score(dataset, predictions)))
cuad.py ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ CUAD metric. """
15
+
16
+ import datasets
17
+
18
+ import evaluate
19
+
20
+ from .compute_score import compute_score
21
+
22
+
23
+ _CITATION = """\
24
+ @article{hendrycks2021cuad,
25
+ title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review},
26
+ author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
27
+ journal={arXiv preprint arXiv:2103.06268},
28
+ year={2021}
29
+ }
30
+ """
31
+
32
+ _DESCRIPTION = """
33
+ This metric wrap the official scoring script for version 1 of the Contract
34
+ Understanding Atticus Dataset (CUAD).
35
+ Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of more than 13,000 labels in 510
36
+ commercial legal contracts that have been manually labeled to identify 41 categories of important
37
+ clauses that lawyers look for when reviewing contracts in connection with corporate transactions.
38
+ """
39
+
40
+ _KWARGS_DESCRIPTION = """
41
+ Computes CUAD scores (EM, F1, AUPR, Precision@80%Recall, and Precision@90%Recall).
42
+ Args:
43
+ predictions: List of question-answers dictionaries with the following key-values:
44
+ - 'id': id of the question-answer pair as given in the references (see below)
45
+ - 'prediction_text': list of possible texts for the answer, as a list of strings
46
+ depending on a threshold on the confidence probability of each prediction.
47
+ references: List of question-answers dictionaries with the following key-values:
48
+ - 'id': id of the question-answer pair (see above),
49
+ - 'answers': a Dict in the CUAD dataset format
50
+ {
51
+ 'text': list of possible texts for the answer, as a list of strings
52
+ 'answer_start': list of start positions for the answer, as a list of ints
53
+ }
54
+ Note that answer_start values are not taken into account to compute the metric.
55
+ Returns:
56
+ 'exact_match': Exact match (the normalized answer exactly match the gold answer)
57
+ 'f1': The F-score of predicted tokens versus the gold answer
58
+ 'aupr': Area Under the Precision-Recall curve
59
+ 'prec_at_80_recall': Precision at 80% recall
60
+ 'prec_at_90_recall': Precision at 90% recall
61
+ Examples:
62
+ >>> predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
63
+ >>> references = [{'answers': {'answer_start': [143, 49], 'text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.']}, 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
64
+ >>> cuad_metric = evaluate.load("cuad")
65
+ >>> results = cuad_metric.compute(predictions=predictions, references=references)
66
+ >>> print(results)
67
+ {'exact_match': 100.0, 'f1': 100.0, 'aupr': 0.0, 'prec_at_80_recall': 1.0, 'prec_at_90_recall': 1.0}
68
+ """
69
+
70
+
71
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
72
+ class CUAD(evaluate.EvaluationModule):
73
+ def _info(self):
74
+ return evaluate.EvaluationModuleInfo(
75
+ description=_DESCRIPTION,
76
+ citation=_CITATION,
77
+ inputs_description=_KWARGS_DESCRIPTION,
78
+ features=datasets.Features(
79
+ {
80
+ "predictions": {
81
+ "id": datasets.Value("string"),
82
+ "prediction_text": datasets.features.Sequence(datasets.Value("string")),
83
+ },
84
+ "references": {
85
+ "id": datasets.Value("string"),
86
+ "answers": datasets.features.Sequence(
87
+ {
88
+ "text": datasets.Value("string"),
89
+ "answer_start": datasets.Value("int32"),
90
+ }
91
+ ),
92
+ },
93
+ }
94
+ ),
95
+ codebase_urls=["https://www.atticusprojectai.org/cuad"],
96
+ reference_urls=["https://www.atticusprojectai.org/cuad"],
97
+ )
98
+
99
+ def _compute(self, predictions, references):
100
+ pred_dict = {prediction["id"]: prediction["prediction_text"] for prediction in predictions}
101
+ dataset = [
102
+ {
103
+ "paragraphs": [
104
+ {
105
+ "qas": [
106
+ {
107
+ "answers": [{"text": answer_text} for answer_text in ref["answers"]["text"]],
108
+ "id": ref["id"],
109
+ }
110
+ for ref in references
111
+ ]
112
+ }
113
+ ]
114
+ }
115
+ ]
116
+ score = compute_score(dataset=dataset, predictions=pred_dict)
117
+ return score
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0