lvwerra HF staff commited on
Commit
47e6046
1 Parent(s): 0b87fad

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +112 -5
  2. app.py +6 -0
  3. meteor.py +128 -0
  4. requirements.txt +4 -0
README.md CHANGED
@@ -1,12 +1,119 @@
1
  ---
2
- title: Meteor
3
- emoji: 📉
4
- colorFrom: purple
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: METEOR
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for METEOR
16
+
17
+ ## Metric description
18
+
19
+ METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a machine translation evaluation metric, which is calculated based on the harmonic mean of precision and recall, with recall weighted more than precision.
20
+
21
+ METEOR is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
22
+
23
+
24
+ ## How to use
25
+
26
+ METEOR has two mandatory arguments:
27
+
28
+ `predictions`: a list of predictions to score. Each prediction should be a string with tokens separated by spaces.
29
+
30
+ `references`: a list of references for each prediction. Each reference should be a string with tokens separated by spaces.
31
+
32
+ It also has several optional parameters:
33
+
34
+ `alpha`: Parameter for controlling relative weights of precision and recall. The default value is `0.9`.
35
+
36
+ `beta`: Parameter for controlling shape of penalty as a function of fragmentation. The default value is `3`.
37
+
38
+ `gamma`: The relative weight assigned to fragmentation penalty. The default is `0.5`.
39
+
40
+ Refer to the [METEOR paper](https://aclanthology.org/W05-0909.pdf) for more information about parameter values and ranges.
41
+
42
+ ```python
43
+ >>> meteor = evaluate.load('meteor')
44
+ >>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
45
+ >>> references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
46
+ >>> results = meteor.compute(predictions=predictions, references=references)
47
+ ```
48
+
49
+ ## Output values
50
+
51
+ The metric outputs a dictionary containing the METEOR score. Its values range from 0 to 1.
52
+
53
+
54
+ ### Values from popular papers
55
+ The [METEOR paper](https://aclanthology.org/W05-0909.pdf) does not report METEOR score values for different models, but it does report that METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data.
56
+
57
+
58
+ ## Examples
59
+
60
+ Maximal values :
61
+
62
+ ```python
63
+ >>> meteor = evaluate.load('meteor')
64
+ >>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
65
+ >>> references = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
66
+ >>> results = meteor.compute(predictions=predictions, references=references)
67
+ >>> print(round(results['meteor'], 2))
68
+ 1.0
69
+ ```
70
+
71
+ Minimal values:
72
+
73
+ ```python
74
+ >>> meteor = evaluate.load('meteor')
75
+ >>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
76
+ >>> references = ["Hello world"]
77
+ >>> results = meteor.compute(predictions=predictions, references=references)
78
+ >>> print(round(results['meteor'], 2))
79
+ 0.0
80
+ ```
81
+
82
+ Partial match:
83
+
84
+ ```python
85
+ >>> meteor = evaluate.load('meteor')
86
+ >>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
87
+ >>> references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
88
+ >>> results = meteor.compute(predictions=predictions, references=references)
89
+ >>> print(round(results['meteor'], 2))
90
+ 0.69
91
+ ```
92
+
93
+ ## Limitations and bias
94
+
95
+ While the correlation between METEOR and human judgments was measured for Chinese and Arabic and found to be significant, further experimentation is needed to check its correlation for other languages.
96
+
97
+ Furthermore, while the alignment and matching done in METEOR is based on unigrams, using multiple word entities (e.g. bigrams) could contribute to improving its accuracy -- this has been proposed in [more recent publications](https://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-naacl-2010.pdf) on the subject.
98
+
99
+
100
+ ## Citation
101
+
102
+ ```bibtex
103
+ @inproceedings{banarjee2005,
104
+ title = {{METEOR}: An Automatic Metric for {MT} Evaluation with Improved Correlation with Human Judgments},
105
+ author = {Banerjee, Satanjeev and Lavie, Alon},
106
+ booktitle = {Proceedings of the {ACL} Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization},
107
+ month = jun,
108
+ year = {2005},
109
+ address = {Ann Arbor, Michigan},
110
+ publisher = {Association for Computational Linguistics},
111
+ url = {https://www.aclweb.org/anthology/W05-0909},
112
+ pages = {65--72},
113
+ }
114
+ ```
115
+
116
+ ## Further References
117
+ - [METEOR -- Wikipedia](https://en.wikipedia.org/wiki/METEOR)
118
+ - [METEOR score -- NLTK](https://www.nltk.org/_modules/nltk/translate/meteor_score.html)
119
+
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("meteor")
6
+ launch_gradio_widget(module)
meteor.py ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ METEOR metric. """
15
+
16
+ import datasets
17
+ import numpy as np
18
+ from datasets.config import importlib_metadata, version
19
+ from nltk.translate import meteor_score
20
+
21
+ import evaluate
22
+
23
+
24
+ NLTK_VERSION = version.parse(importlib_metadata.version("nltk"))
25
+ if NLTK_VERSION >= version.Version("3.6.4"):
26
+ from nltk import word_tokenize
27
+
28
+
29
+ _CITATION = """\
30
+ @inproceedings{banarjee2005,
31
+ title = {{METEOR}: An Automatic Metric for {MT} Evaluation with Improved Correlation with Human Judgments},
32
+ author = {Banerjee, Satanjeev and Lavie, Alon},
33
+ booktitle = {Proceedings of the {ACL} Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization},
34
+ month = jun,
35
+ year = {2005},
36
+ address = {Ann Arbor, Michigan},
37
+ publisher = {Association for Computational Linguistics},
38
+ url = {https://www.aclweb.org/anthology/W05-0909},
39
+ pages = {65--72},
40
+ }
41
+ """
42
+
43
+ _DESCRIPTION = """\
44
+ METEOR, an automatic metric for machine translation evaluation
45
+ that is based on a generalized concept of unigram matching between the
46
+ machine-produced translation and human-produced reference translations.
47
+ Unigrams can be matched based on their surface forms, stemmed forms,
48
+ and meanings; furthermore, METEOR can be easily extended to include more
49
+ advanced matching strategies. Once all generalized unigram matches
50
+ between the two strings have been found, METEOR computes a score for
51
+ this matching using a combination of unigram-precision, unigram-recall, and
52
+ a measure of fragmentation that is designed to directly capture how
53
+ well-ordered the matched words in the machine translation are in relation
54
+ to the reference.
55
+
56
+ METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic
57
+ data and 0.331 on the Chinese data. This is shown to be an improvement on
58
+ using simply unigram-precision, unigram-recall and their harmonic F1
59
+ combination.
60
+ """
61
+
62
+ _KWARGS_DESCRIPTION = """
63
+ Computes METEOR score of translated segments against one or more references.
64
+ Args:
65
+ predictions: list of predictions to score. Each prediction
66
+ should be a string with tokens separated by spaces.
67
+ references: list of reference for each prediction. Each
68
+ reference should be a string with tokens separated by spaces.
69
+ alpha: Parameter for controlling relative weights of precision and recall. default: 0.9
70
+ beta: Parameter for controlling shape of penalty as a function of fragmentation. default: 3
71
+ gamma: Relative weight assigned to fragmentation penalty. default: 0.5
72
+ Returns:
73
+ 'meteor': meteor score.
74
+ Examples:
75
+
76
+ >>> meteor = evaluate.load('meteor')
77
+ >>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
78
+ >>> references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
79
+ >>> results = meteor.compute(predictions=predictions, references=references)
80
+ >>> print(round(results["meteor"], 4))
81
+ 0.6944
82
+ """
83
+
84
+
85
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
86
+ class Meteor(evaluate.EvaluationModule):
87
+ def _info(self):
88
+ return evaluate.EvaluationModuleInfo(
89
+ description=_DESCRIPTION,
90
+ citation=_CITATION,
91
+ inputs_description=_KWARGS_DESCRIPTION,
92
+ features=datasets.Features(
93
+ {
94
+ "predictions": datasets.Value("string", id="sequence"),
95
+ "references": datasets.Value("string", id="sequence"),
96
+ }
97
+ ),
98
+ codebase_urls=["https://github.com/nltk/nltk/blob/develop/nltk/translate/meteor_score.py"],
99
+ reference_urls=[
100
+ "https://www.nltk.org/api/nltk.translate.html#module-nltk.translate.meteor_score",
101
+ "https://en.wikipedia.org/wiki/METEOR",
102
+ ],
103
+ )
104
+
105
+ def _download_and_prepare(self, dl_manager):
106
+ import nltk
107
+
108
+ nltk.download("wordnet")
109
+ if NLTK_VERSION >= version.Version("3.6.5"):
110
+ nltk.download("punkt")
111
+ if NLTK_VERSION >= version.Version("3.6.6"):
112
+ nltk.download("omw-1.4")
113
+
114
+ def _compute(self, predictions, references, alpha=0.9, beta=3, gamma=0.5):
115
+ if NLTK_VERSION >= version.Version("3.6.5"):
116
+ scores = [
117
+ meteor_score.single_meteor_score(
118
+ word_tokenize(ref), word_tokenize(pred), alpha=alpha, beta=beta, gamma=gamma
119
+ )
120
+ for ref, pred in zip(references, predictions)
121
+ ]
122
+ else:
123
+ scores = [
124
+ meteor_score.single_meteor_score(ref, pred, alpha=alpha, beta=beta, gamma=gamma)
125
+ for ref, pred in zip(references, predictions)
126
+ ]
127
+
128
+ return {"meteor": np.mean(scores)}
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ nltk