lvwerra HF staff commited on
Commit
2b55a7c
·
1 Parent(s): 0718045

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +128 -5
  2. app.py +6 -0
  3. requirements.txt +6 -0
  4. rouge.py +131 -0
README.md CHANGED
@@ -1,12 +1,135 @@
1
  ---
2
- title: Rouge
3
- emoji: 👁
4
- colorFrom: indigo
5
- colorTo: green
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: ROUGE
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for ROUGE
16
+
17
+ ## Metric Description
18
+ ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
19
+
20
+ Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
21
+
22
+ This metrics is a wrapper around the [Google Research reimplementation of ROUGE](https://github.com/google-research/google-research/tree/master/rouge)
23
+
24
+ ## How to Use
25
+ At minimum, this metric takes as input a list of predictions and a list of references:
26
+ ```python
27
+ >>> rouge = evaluate.load('rouge')
28
+ >>> predictions = ["hello there", "general kenobi"]
29
+ >>> references = ["hello there", "general kenobi"]
30
+ >>> results = rouge.compute(predictions=predictions,
31
+ ... references=references)
32
+ >>> print(list(results.keys()))
33
+ ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
34
+ >>> print(results["rouge1"])
35
+ AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
36
+ >>> print(results["rouge1"].mid.fmeasure)
37
+ 1.0
38
+ ```
39
+
40
+ ### Inputs
41
+ - **predictions** (`list`): list of predictions to score. Each prediction
42
+ should be a string with tokens separated by spaces.
43
+ - **references** (`list`): list of reference for each prediction. Each
44
+ reference should be a string with tokens separated by spaces.
45
+ - **rouge_types** (`list`): A list of rouge types to calculate. Defaults to `['rouge1', 'rouge2', 'rougeL', 'rougeLsum']`.
46
+ - Valid rouge types:
47
+ - `"rouge1"`: unigram (1-gram) based scoring
48
+ - `"rouge2"`: bigram (2-gram) based scoring
49
+ - `"rougeL"`: Longest common subsequence based scoring.
50
+ - `"rougeLSum"`: splits text using `"\n"`
51
+ - See [here](https://github.com/huggingface/datasets/issues/617) for more information
52
+ - **use_aggregator** (`boolean`): If True, returns aggregates. Defaults to `True`.
53
+ - **use_stemmer** (`boolean`): If `True`, uses Porter stemmer to strip word suffixes. Defaults to `False`.
54
+
55
+ ### Output Values
56
+ The output is a dictionary with one entry for each rouge type in the input list `rouge_types`. If `use_aggregator=False`, each dictionary entry is a list of Score objects, with one score for each sentence. Each Score object includes the `precision`, `recall`, and `fmeasure`. E.g. if `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=False`, the output is:
57
+
58
+ ```python
59
+ {'rouge1': [Score(precision=1.0, recall=0.5, fmeasure=0.6666666666666666), Score(precision=1.0, recall=1.0, fmeasure=1.0)], 'rouge2': [Score(precision=0.0, recall=0.0, fmeasure=0.0), Score(precision=1.0, recall=1.0, fmeasure=1.0)]}
60
+ ```
61
+
62
+ If `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=True`, the output is of the following format:
63
+ ```python
64
+ {'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)), 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}
65
+ ```
66
+
67
+ The `precision`, `recall`, and `fmeasure` values all have a range of 0 to 1.
68
+
69
+
70
+ #### Values from Popular Papers
71
+
72
+
73
+ ### Examples
74
+ An example without aggregation:
75
+ ```python
76
+ >>> rouge = evaluate.load('rouge')
77
+ >>> predictions = ["hello goodbye", "ankh morpork"]
78
+ >>> references = ["goodbye", "general kenobi"]
79
+ >>> results = rouge.compute(predictions=predictions,
80
+ ... references=references)
81
+ >>> print(list(results.keys()))
82
+ ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
83
+ >>> print(results["rouge1"])
84
+ [Score(precision=0.5, recall=0.5, fmeasure=0.5), Score(precision=0.0, recall=0.0, fmeasure=0.0)]
85
+ ```
86
+
87
+ The same example, but with aggregation:
88
+ ```python
89
+ >>> rouge = evaluate.load('rouge')
90
+ >>> predictions = ["hello goodbye", "ankh morpork"]
91
+ >>> references = ["goodbye", "general kenobi"]
92
+ >>> results = rouge.compute(predictions=predictions,
93
+ ... references=references,
94
+ ... use_aggregator=True)
95
+ >>> print(list(results.keys()))
96
+ ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
97
+ >>> print(results["rouge1"])
98
+ AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.25, recall=0.25, fmeasure=0.25), high=Score(precision=0.5, recall=0.5, fmeasure=0.5))
99
+ ```
100
+
101
+ The same example, but only calculating `rouge_1`:
102
+ ```python
103
+ >>> rouge = evaluate.load('rouge')
104
+ >>> predictions = ["hello goodbye", "ankh morpork"]
105
+ >>> references = ["goodbye", "general kenobi"]
106
+ >>> results = rouge.compute(predictions=predictions,
107
+ ... references=references,
108
+ ... rouge_types=['rouge_1'],
109
+ ... use_aggregator=True)
110
+ >>> print(list(results.keys()))
111
+ ['rouge1']
112
+ >>> print(results["rouge1"])
113
+ AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.25, recall=0.25, fmeasure=0.25), high=Score(precision=0.5, recall=0.5, fmeasure=0.5))
114
+ ```
115
+
116
+ ## Limitations and Bias
117
+ See [Schluter (2017)](https://aclanthology.org/E17-2007/) for an in-depth discussion of many of ROUGE's limits.
118
+
119
+ ## Citation
120
+ ```bibtex
121
+ @inproceedings{lin-2004-rouge,
122
+ title = "{ROUGE}: A Package for Automatic Evaluation of Summaries",
123
+ author = "Lin, Chin-Yew",
124
+ booktitle = "Text Summarization Branches Out",
125
+ month = jul,
126
+ year = "2004",
127
+ address = "Barcelona, Spain",
128
+ publisher = "Association for Computational Linguistics",
129
+ url = "https://www.aclweb.org/anthology/W04-1013",
130
+ pages = "74--81",
131
+ }
132
+ ```
133
+
134
+ ## Further References
135
+ - This metrics is a wrapper around the [Google Research reimplementation of ROUGE](https://github.com/google-research/google-research/tree/master/rouge)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("rouge")
6
+ launch_gradio_widget(module)
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ absl-py
5
+ nltk
6
+ rouge_score
rouge.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ ROUGE metric from Google Research github repo. """
15
+
16
+ # The dependencies in https://github.com/google-research/google-research/blob/master/rouge/requirements.txt
17
+ import absl # Here to have a nice missing dependency error message early on
18
+ import datasets
19
+ import nltk # Here to have a nice missing dependency error message early on
20
+ import numpy # Here to have a nice missing dependency error message early on
21
+ import six # Here to have a nice missing dependency error message early on
22
+ from rouge_score import rouge_scorer, scoring
23
+
24
+ import evaluate
25
+
26
+
27
+ _CITATION = """\
28
+ @inproceedings{lin-2004-rouge,
29
+ title = "{ROUGE}: A Package for Automatic Evaluation of Summaries",
30
+ author = "Lin, Chin-Yew",
31
+ booktitle = "Text Summarization Branches Out",
32
+ month = jul,
33
+ year = "2004",
34
+ address = "Barcelona, Spain",
35
+ publisher = "Association for Computational Linguistics",
36
+ url = "https://www.aclweb.org/anthology/W04-1013",
37
+ pages = "74--81",
38
+ }
39
+ """
40
+
41
+ _DESCRIPTION = """\
42
+ ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for
43
+ evaluating automatic summarization and machine translation software in natural language processing.
44
+ The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
45
+
46
+ Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
47
+
48
+ This metrics is a wrapper around Google Research reimplementation of ROUGE:
49
+ https://github.com/google-research/google-research/tree/master/rouge
50
+ """
51
+
52
+ _KWARGS_DESCRIPTION = """
53
+ Calculates average rouge scores for a list of hypotheses and references
54
+ Args:
55
+ predictions: list of predictions to score. Each prediction
56
+ should be a string with tokens separated by spaces.
57
+ references: list of reference for each prediction. Each
58
+ reference should be a string with tokens separated by spaces.
59
+ rouge_types: A list of rouge types to calculate.
60
+ Valid names:
61
+ `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
62
+ `"rougeL"`: Longest common subsequence based scoring.
63
+ `"rougeLSum"`: rougeLsum splits text using `"\n"`.
64
+ See details in https://github.com/huggingface/datasets/issues/617
65
+ use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
66
+ use_aggregator: Return aggregates if this is set to True
67
+ Returns:
68
+ rouge1: rouge_1 (precision, recall, f1),
69
+ rouge2: rouge_2 (precision, recall, f1),
70
+ rougeL: rouge_l (precision, recall, f1),
71
+ rougeLsum: rouge_lsum (precision, recall, f1)
72
+ Examples:
73
+
74
+ >>> rouge = evaluate.load('rouge')
75
+ >>> predictions = ["hello there", "general kenobi"]
76
+ >>> references = ["hello there", "general kenobi"]
77
+ >>> results = rouge.compute(predictions=predictions, references=references)
78
+ >>> print(list(results.keys()))
79
+ ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
80
+ >>> print(results["rouge1"])
81
+ AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
82
+ >>> print(results["rouge1"].mid.fmeasure)
83
+ 1.0
84
+ """
85
+
86
+
87
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
88
+ class Rouge(evaluate.EvaluationModule):
89
+ def _info(self):
90
+ return evaluate.EvaluationModuleInfo(
91
+ description=_DESCRIPTION,
92
+ citation=_CITATION,
93
+ inputs_description=_KWARGS_DESCRIPTION,
94
+ features=datasets.Features(
95
+ {
96
+ "predictions": datasets.Value("string", id="sequence"),
97
+ "references": datasets.Value("string", id="sequence"),
98
+ }
99
+ ),
100
+ codebase_urls=["https://github.com/google-research/google-research/tree/master/rouge"],
101
+ reference_urls=[
102
+ "https://en.wikipedia.org/wiki/ROUGE_(metric)",
103
+ "https://github.com/google-research/google-research/tree/master/rouge",
104
+ ],
105
+ )
106
+
107
+ def _compute(self, predictions, references, rouge_types=None, use_aggregator=True, use_stemmer=False):
108
+ if rouge_types is None:
109
+ rouge_types = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
110
+
111
+ scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=use_stemmer)
112
+ if use_aggregator:
113
+ aggregator = scoring.BootstrapAggregator()
114
+ else:
115
+ scores = []
116
+
117
+ for ref, pred in zip(references, predictions):
118
+ score = scorer.score(ref, pred)
119
+ if use_aggregator:
120
+ aggregator.add_scores(score)
121
+ else:
122
+ scores.append(score)
123
+
124
+ if use_aggregator:
125
+ result = aggregator.aggregate()
126
+ else:
127
+ result = {}
128
+ for key in scores[0]:
129
+ result[key] = list(score[key] for score in scores)
130
+
131
+ return result