Spaces:

evaluate-metric
/

rouge

Running

App Files Files Community

lvwerra HF staff commited on Jun 27, 2022

Commit

5f757f8

1 Parent(s): 90c19cd

Update Space (evaluate main: 0b7ed95a)

Browse files

Files changed (2) hide show

README.md +11 -14
rouge.py +10 -11

README.md CHANGED Viewed

@@ -38,12 +38,8 @@ At minimum, this metric takes as input a list of predictions and a list of refer
 >>> references = ["hello there", "general kenobi"]
 >>> results = rouge.compute(predictions=predictions,
 ...                         references=references)
->>> print(list(results.keys()))
-['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
->>> print(results["rouge1"])
-AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
->>> print(results["rouge1"].mid.fmeasure)
-1.0
 ```
 ### Inputs
@@ -62,18 +58,18 @@ AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(pre
 - **use_stemmer** (`boolean`): If `True`, uses Porter stemmer to strip word suffixes. Defaults to `False`.
 ### Output Values
-The output is a dictionary with one entry for each rouge type in the input list `rouge_types`. If `use_aggregator=False`, each dictionary entry is a list of Score objects, with one score for each sentence. Each Score object includes the `precision`, `recall`, and `fmeasure`. E.g. if `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=False`, the output is:
 ```python
-{'rouge1': [Score(precision=1.0, recall=0.5, fmeasure=0.6666666666666666), Score(precision=1.0, recall=1.0, fmeasure=1.0)], 'rouge2': [Score(precision=0.0, recall=0.0, fmeasure=0.0), Score(precision=1.0, recall=1.0, fmeasure=1.0)]}
 ```
 If `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=True`, the output is of the following format:
 ```python
-{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)), 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}
 ```
-The `precision`, `recall`, and `fmeasure` values all have a range of 0 to 1.
 #### Values from Popular Papers
@@ -86,11 +82,12 @@ An example without aggregation:
 >>> predictions = ["hello goodbye", "ankh morpork"]
 >>> references = ["goodbye", "general kenobi"]
 >>> results = rouge.compute(predictions=predictions,
-...                         references=references)
 >>> print(list(results.keys()))
 ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
 >>> print(results["rouge1"])
-[Score(precision=0.5, recall=0.5, fmeasure=0.5), Score(precision=0.0, recall=0.0, fmeasure=0.0)]
 ```
 The same example, but with aggregation:
@@ -104,7 +101,7 @@ The same example, but with aggregation:
 >>> print(list(results.keys()))
 ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
 >>> print(results["rouge1"])
-AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.25, recall=0.25, fmeasure=0.25), high=Score(precision=0.5, recall=0.5, fmeasure=0.5))
 ```
 The same example, but only calculating `rouge_1`:
@@ -119,7 +116,7 @@ The same example, but only calculating `rouge_1`:
 >>> print(list(results.keys()))
 ['rouge1']
 >>> print(results["rouge1"])
-AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.25, recall=0.25, fmeasure=0.25), high=Score(precision=0.5, recall=0.5, fmeasure=0.5))
 ```
 ## Limitations and Bias

 >>> references = ["hello there", "general kenobi"]
 >>> results = rouge.compute(predictions=predictions,
 ...                         references=references)
+>>> print(results)
+{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
 ```
 ### Inputs
 - **use_stemmer** (`boolean`): If `True`, uses Porter stemmer to strip word suffixes. Defaults to `False`.
 ### Output Values
+The output is a dictionary with one entry for each rouge type in the input list `rouge_types`. If `use_aggregator=False`, each dictionary entry is a list of scores, with one score for each sentence. E.g. if `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=False`, the output is:
 ```python
+{'rouge1': [0.6666666666666666, 1.0], 'rouge2': [0.0, 1.0]}
 ```
 If `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=True`, the output is of the following format:
 ```python
+{'rouge1': 1.0, 'rouge2': 1.0}
 ```
+The ROUGE values are in the range of 0 to 1.
 #### Values from Popular Papers
 >>> predictions = ["hello goodbye", "ankh morpork"]
 >>> references = ["goodbye", "general kenobi"]
 >>> results = rouge.compute(predictions=predictions,
+...                         references=references,
+...                         use_aggregator=False)
 >>> print(list(results.keys()))
 ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
 >>> print(results["rouge1"])
+[0.5, 0.0]
 ```
 The same example, but with aggregation:
 >>> print(list(results.keys()))
 ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
 >>> print(results["rouge1"])
+0.25
 ```
 The same example, but only calculating `rouge_1`:
 >>> print(list(results.keys()))
 ['rouge1']
 >>> print(results["rouge1"])
+0.25
 ```
 ## Limitations and Bias

rouge.py CHANGED Viewed

@@ -65,22 +65,18 @@ Args:
     use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
     use_aggregator: Return aggregates if this is set to True
 Returns:
-    rouge1: rouge_1 (precision, recall, f1),
-    rouge2: rouge_2 (precision, recall, f1),
-    rougeL: rouge_l (precision, recall, f1),
-    rougeLsum: rouge_lsum (precision, recall, f1)
 Examples:
     >>> rouge = evaluate.load('rouge')
     >>> predictions = ["hello there", "general kenobi"]
     >>> references = ["hello there", "general kenobi"]
     >>> results = rouge.compute(predictions=predictions, references=references)
-    >>> print(list(results.keys()))
-    ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
-    >>> print(results["rouge1"])
-    AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
-    >>> print(results["rouge1"].mid.fmeasure)
-    1.0
 """
@@ -123,9 +119,12 @@ class Rouge(evaluate.EvaluationModule):
         if use_aggregator:
             result = aggregator.aggregate()
         else:
             result = {}
             for key in scores[0]:
-                result[key] = list(score[key] for score in scores)
         return result

     use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
     use_aggregator: Return aggregates if this is set to True
 Returns:
+    rouge1: rouge_1 (f1),
+    rouge2: rouge_2 (f1),
+    rougeL: rouge_l (f1),
+    rougeLsum: rouge_lsum (f1)
 Examples:
     >>> rouge = evaluate.load('rouge')
     >>> predictions = ["hello there", "general kenobi"]
     >>> references = ["hello there", "general kenobi"]
     >>> results = rouge.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
 """
         if use_aggregator:
             result = aggregator.aggregate()
+            for key in result:
+                result[key] = result[key].mid.fmeasure
         else:
             result = {}
             for key in scores[0]:
+                result[key] = list(score[key].fmeasure for score in scores)
         return result