Spaces:

evaluate-metric
/

meteor

Running

App Files Files Community

lvwerra HF Staff commited on May 20, 2022

Commit

47e6046

1 Parent(s): 0b87fad

Update Space (evaluate main: 828c6327)

Browse files

Files changed (4) hide show

README.md +112 -5
app.py +6 -0
meteor.py +128 -0
requirements.txt +4 -0

README.md CHANGED Viewed

@@ -1,12 +1,119 @@
 ---
-title: Meteor
-emoji: 📉
-colorFrom: purple
-colorTo: pink
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title: METEOR
+emoji: 🤗
+colorFrom: blue
+colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
 ---
+# Metric Card for METEOR
+## Metric description
+METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a machine translation evaluation metric, which is calculated based on the harmonic mean of precision and recall, with recall weighted more than precision.
+METEOR is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
+## How to use
+METEOR has two mandatory arguments:
+`predictions`: a list of predictions to score. Each prediction should be a string with tokens separated by spaces.
+`references`: a list of references for each prediction. Each reference should be a string with tokens separated by spaces.
+It also has several optional parameters:
+`alpha`: Parameter for controlling relative weights of precision and recall. The default value is `0.9`.
+`beta`: Parameter for controlling shape of penalty as a function of fragmentation. The default value is `3`.
+`gamma`: The relative weight assigned to fragmentation penalty. The default is `0.5`.
+Refer to the [METEOR paper](https://aclanthology.org/W05-0909.pdf) for more information about parameter values and ranges.
+```python
+>>> meteor = evaluate.load('meteor')
+>>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
+>>> references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
+>>> results = meteor.compute(predictions=predictions, references=references)
+```
+## Output values
+The metric outputs a dictionary containing the METEOR score. Its values range from 0 to 1.
+### Values from popular papers
+The [METEOR paper](https://aclanthology.org/W05-0909.pdf) does not report METEOR score values for different models, but it does report that METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data.
+## Examples
+Maximal values :
+```python
+>>> meteor = evaluate.load('meteor')
+>>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
+>>> references = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
+>>> results = meteor.compute(predictions=predictions, references=references)
+>>> print(round(results['meteor'], 2))
+1.0
+```
+Minimal values:
+```python
+>>> meteor = evaluate.load('meteor')
+>>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
+>>> references = ["Hello world"]
+>>> results = meteor.compute(predictions=predictions, references=references)
+>>> print(round(results['meteor'], 2))
+0.0
+```
+Partial match:
+```python
+>>> meteor = evaluate.load('meteor')
+>>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
+>>> references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
+>>> results = meteor.compute(predictions=predictions, references=references)
+>>> print(round(results['meteor'], 2))
+0.69
+```
+## Limitations and bias
+While the correlation between METEOR and human judgments was measured for Chinese and Arabic and found to be significant, further experimentation is needed to check its correlation for other languages.
+Furthermore, while the alignment and matching done in METEOR is based on unigrams, using multiple word entities (e.g. bigrams) could contribute to improving its accuracy -- this has been proposed in [more recent publications](https://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-naacl-2010.pdf) on the subject.
+## Citation
+```bibtex
+@inproceedings{banarjee2005,
+  title     = {{METEOR}: An Automatic Metric for {MT} Evaluation with Improved Correlation with Human Judgments},
+  author    = {Banerjee, Satanjeev  and Lavie, Alon},
+  booktitle = {Proceedings of the {ACL} Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization},
+  month     = jun,
+  year      = {2005},
+  address   = {Ann Arbor, Michigan},
+  publisher = {Association for Computational Linguistics},
+  url       = {https://www.aclweb.org/anthology/W05-0909},
+  pages     = {65--72},
+}
+```
+## Further References
+- [METEOR -- Wikipedia](https://en.wikipedia.org/wiki/METEOR)
+- [METEOR score -- NLTK](https://www.nltk.org/_modules/nltk/translate/meteor_score.html)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("meteor")
+launch_gradio_widget(module)

meteor.py ADDED Viewed

	@@ -0,0 +1,128 @@

+# Copyright 2020 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" METEOR metric. """
+import datasets
+import numpy as np
+from datasets.config import importlib_metadata, version
+from nltk.translate import meteor_score
+import evaluate
+NLTK_VERSION = version.parse(importlib_metadata.version("nltk"))
+if NLTK_VERSION >= version.Version("3.6.4"):
+    from nltk import word_tokenize
+_CITATION = """\
+@inproceedings{banarjee2005,
+  title     = {{METEOR}: An Automatic Metric for {MT} Evaluation with Improved Correlation with Human Judgments},
+  author    = {Banerjee, Satanjeev  and Lavie, Alon},
+  booktitle = {Proceedings of the {ACL} Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization},
+  month     = jun,
+  year      = {2005},
+  address   = {Ann Arbor, Michigan},
+  publisher = {Association for Computational Linguistics},
+  url       = {https://www.aclweb.org/anthology/W05-0909},
+  pages     = {65--72},
+}
+"""
+_DESCRIPTION = """\
+METEOR, an automatic metric for machine translation evaluation
+that is based on a generalized concept of unigram matching between the
+machine-produced translation and human-produced reference translations.
+Unigrams can be matched based on their surface forms, stemmed forms,
+and meanings; furthermore, METEOR can be easily extended to include more
+advanced matching strategies. Once all generalized unigram matches
+between the two strings have been found, METEOR computes a score for
+this matching using a combination of unigram-precision, unigram-recall, and
+a measure of fragmentation that is designed to directly capture how
+well-ordered the matched words in the machine translation are in relation
+to the reference.
+METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic
+data and 0.331 on the Chinese data. This is shown to be an improvement on
+using simply unigram-precision, unigram-recall and their harmonic F1
+combination.
+"""
+_KWARGS_DESCRIPTION = """
+Computes METEOR score of translated segments against one or more references.
+Args:
+    predictions: list of predictions to score. Each prediction
+        should be a string with tokens separated by spaces.
+    references: list of reference for each prediction. Each
+        reference should be a string with tokens separated by spaces.
+    alpha: Parameter for controlling relative weights of precision and recall. default: 0.9
+    beta: Parameter for controlling shape of penalty as a function of fragmentation. default: 3
+    gamma: Relative weight assigned to fragmentation penalty. default: 0.5
+Returns:
+    'meteor': meteor score.
+Examples:
+    >>> meteor = evaluate.load('meteor')
+    >>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
+    >>> references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
+    >>> results = meteor.compute(predictions=predictions, references=references)
+    >>> print(round(results["meteor"], 4))
+    0.6944
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Meteor(evaluate.EvaluationModule):
+    def _info(self):
+        return evaluate.EvaluationModuleInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("string", id="sequence"),
+                    "references": datasets.Value("string", id="sequence"),
+                }
+            ),
+            codebase_urls=["https://github.com/nltk/nltk/blob/develop/nltk/translate/meteor_score.py"],
+            reference_urls=[
+                "https://www.nltk.org/api/nltk.translate.html#module-nltk.translate.meteor_score",
+                "https://en.wikipedia.org/wiki/METEOR",
+            ],
+        )
+    def _download_and_prepare(self, dl_manager):
+        import nltk
+        nltk.download("wordnet")
+        if NLTK_VERSION >= version.Version("3.6.5"):
+            nltk.download("punkt")
+        if NLTK_VERSION >= version.Version("3.6.6"):
+            nltk.download("omw-1.4")
+    def _compute(self, predictions, references, alpha=0.9, beta=3, gamma=0.5):
+        if NLTK_VERSION >= version.Version("3.6.5"):
+            scores = [
+                meteor_score.single_meteor_score(
+                    word_tokenize(ref), word_tokenize(pred), alpha=alpha, beta=beta, gamma=gamma
+                )
+                for ref, pred in zip(references, predictions)
+            ]
+        else:
+            scores = [
+                meteor_score.single_meteor_score(ref, pred, alpha=alpha, beta=beta, gamma=gamma)
+                for ref, pred in zip(references, predictions)
+            ]
+        return {"meteor": np.mean(scores)}

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+# TODO: fix github to release
+git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
+datasets~=2.0
+nltk