Spaces:

manueldeprada
/

beer

Runtime error

App Files Files Community

Manuel de Prada commited on Apr 29, 2023

Commit

80dcff0

•

1 Parent(s): 0f80edd

beer metric

Browse files

Files changed (3) hide show

README.md +58 -27
beer.py +85 -52
tests.py +7 -14

README.md CHANGED Viewed

@@ -1,50 +1,81 @@
 ---
 title: BEER
-datasets:
--
-tags:
-- evaluate
-- metric
-description: "TODO: add a description here"
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 ---
 # Metric Card for BEER
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
-## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
-## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
-### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
-### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
-## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
-## Further References
-*Add any useful further references.*

 ---
 title: BEER
+emoji: 🤗
+colorFrom: blue
+colorTo: red
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+   BEER 2.0 (BEtter Evaluation as Ranking) is a trained machine translation evaluation metric with high correlation with human judgment both on sentence and corpus level. It is a linear model-based metric for sentence-level evaluation in machine translation (MT) that combines 33 relatively dense features, including character n-grams and reordering features.
+It employs a learning-to-rank framework to differentiate between function and non-function words and weighs each word type according to its importance for evaluation.
+The model is trained on ranking similar translations using a vector of feature values for each system output.
+BEER outperforms the strong baseline metric METEOR in five out of eight language pairs, showing that less sparse features at the sentence level can lead to state-of-the-art results.
+Features on character n-grams are crucial, and higher-order character n-grams are less prone to sparse counts than word n-grams.
 ---
 # Metric Card for BEER
+## Metric description
+BEER 2.0 (BEtter Evaluation as Ranking) is a trained machine translation evaluation metric with high correlation with human judgment both on sentence and corpus level. It is a linear model-based metric for sentence-level evaluation in machine translation (MT) that combines 33 relatively dense features, including character n-grams and reordering features.
+It employs a learning-to-rank framework to differentiate between function and non-function words and weighs each word type according to its importance for evaluation.
+The model is trained on ranking similar translations using a vector of feature values for each system output.
+BEER outperforms the strong baseline metric METEOR in five out of eight language pairs, showing that less sparse features at the sentence level can lead to state-of-the-art results.
+Features on character n-grams are crucial, and higher-order character n-grams are less prone to sparse counts than word n-grams.
+## How to use
+BEER has two mandatory arguments:
+`predictions`: a `list` of predictions to score. Each prediction should be a string with tokens separated by spaces.
+`references`: a `list` of references (multiple `references` per `prediction` are not allowed). Each reference should be a string with tokens separated by spaces.
+## Prerequisites
+This module downloads and executes the original authors' BEER package. You must have Java installed to run it, and it will fail to load otherwise.
+Since it is not Python code and calls the BEER executable, it is much faster to pass a batch of predicitions and references to evaluate in a single call than to iteratively call the metric with one prediction and reference at a time.
+```python
+>>> meteor = evaluate.load('beer')
+>>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party", "hello there general kenobi"]
+>>> references = ["It is a guide to action that ensures that the military will forever heed Party commands", "hello general kenobi"]
+>>> results = meteor.compute(predictions=predictions, references=references)
+```
+## Output values
+The metric outputs a dictionary containing the BEER score and the scores of the individual sentences:
+```
+{'beer': 0.4557488704361114,
+ 'beer_scores': [0.6380935618609037, 0.7291530494474304]}
+```
 ## Citation
+```bibtex
+@inproceedings{stanojevic-simaan-2014-fitting,
+    title = "Fitting Sentence Level Translation Evaluation with Many Dense Features",
+    author = "Stanojevi{\'c}, Milo{\v{s}}  and
+      Sima{'}an, Khalil",
+    booktitle = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})",
+    month = oct,
+    year = "2014",
+    address = "Doha, Qatar",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/D14-1025",
+    doi = "10.3115/v1/D14-1025",
+    pages = "202--206",
+}
+```
+## Further References
+- [BEER -- Official GitHub](https://github.com/stanojevic/beer)

beer.py CHANGED Viewed

@@ -1,4 +1,4 @@
-# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -11,85 +11,118 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""TODO: Add a description here."""
-import evaluate
 import datasets
-# TODO: Add BibTeX citation
 _CITATION = """\
-@InProceedings{huggingface:module,
-title = {A great new module},
-authors={huggingface, Inc.},
-year={2020}
 }
 """
-# TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
-"""
-# TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
-Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
         should be a string with tokens separated by spaces.
     references: list of reference for each prediction. Each
         reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
 Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
-    >>> my_new_module = evaluate.load("my_new_module")
-    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
-    >>> print(results)
-    {'accuracy': 1.0}
 """
-# TODO: Define external resources urls if needed
-BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
-class BEER(evaluate.Metric):
-    """TODO: Short description of my evaluation module."""
     def _info(self):
-        # TODO: Specifies the evaluate.EvaluationModuleInfo object
         return evaluate.MetricInfo(
-            # This is the description that will appear on the modules page.
-            module_type="metric",
             description=_DESCRIPTION,
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
-            # This defines the format of each prediction and reference
-            features=datasets.Features({
-                'predictions': datasets.Value('int64'),
-                'references': datasets.Value('int64'),
-            }),
-            # Homepage of the module for documentation
-            homepage="http://module.homepage",
-            # Additional links to the codebase or references
-            codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
-            reference_urls=["http://path.to.reference.url/new_module"]
         )
     def _download_and_prepare(self, dl_manager):
-        """Optional: download external resources useful to compute the scores"""
-        # TODO: Download external resources if needed
-        pass
     def _compute(self, predictions, references):
-        """Returns the scores"""
-        # TODO: Compute the different scores of the module
-        accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
-        return {
-            "accuracy": accuracy,
-        }

+# Copyright 2020 The HuggingFace Evaluate Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+""" BEER metric. """
+import os
+import re
 import datasets
+import evaluate
+import subprocess
+import tempfile
 _CITATION = """\
+@inproceedings{banarjee2005,
+  title     = {Fitting Sentence Level Translation Evaluation with Many Dense Features},
+  author    = {Stanojevi{\'c}, Milo{\v{s}}  and Sima{'}an, Khalil},
+  booktitle = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})",
+  month = oct,
+  year = "2014",
+  address = "Doha, Qatar",
+  publisher = "Association for Computational Linguistics",
+  url = "https://aclanthology.org/D14-1025",
+  doi = "10.3115/v1/D14-1025",
+  pages = "202--206",
 }
 """
 _DESCRIPTION = """\
+BEER is a linear model-based metric for sentence-level evaluation in machine translation (MT) that combines 33 relatively dense features, including character n-grams and reordering features.
+It employs a learning-to-rank framework to differentiate between function and non-function words and weighs each word type according to its importance for evaluation.
+The model is trained on ranking similar translations using a vector of feature values for each system output.
+BEER outperforms the strong baseline metric METEOR in five out of eight language pairs, showing that less sparse features at the sentence level can lead to state-of-the-art results.
+Features on character n-grams are crucial, and higher-order character n-grams are less prone to sparse counts than word n-grams.
+"""
 _KWARGS_DESCRIPTION = """
+Computes BEER score of translated segments against one or more references.
 Args:
+    predictions: list of predictions to score. Each prediction
         should be a string with tokens separated by spaces.
     references: list of reference for each prediction. Each
         reference should be a string with tokens separated by spaces.
 Returns:
+    'beer': beer score.
+    'scores': list of scores for each sentence.
 Examples:
+    >>> beer = evaluate.load('beer')
+    >>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
+    >>> references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
+    >>> results = beer.compute(predictions=predictions, references=references)
+    >>> print(round(results["beer"], 4))
+    0.3190
 """
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Beer(evaluate.Metric):
     def _info(self):
         return evaluate.MetricInfo(
             description=_DESCRIPTION,
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
+            features=[
+                datasets.Features(
+                    {
+                        "predictions": datasets.Value("string", id="sequence"),
+                        "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
+                    }
+                ),
+                datasets.Features(
+                    {
+                        "predictions": datasets.Value("string", id="sequence"),
+                        "references": datasets.Value("string", id="sequence"),
+                    }
+                ),
+            ],
+            codebase_urls=["https://github.com/stanojevic/beer"],
+            reference_urls=[
+                "http://aclweb.org/anthology/D14-1025",
+            ],
         )
     def _download_and_prepare(self, dl_manager):
+        try:
+            subprocess.check_output(["java", "-version"], stderr=subprocess.STDOUT)
+        except Exception as e:
+            raise Exception("Java is not installed. Please install java and try again.")
+        dl_manager = datasets.download.DownloadManager()
+        _BEER_URL = "https://raw.githubusercontent.com/stanojevic/beer/master/packaged/beer_2.0.tar.gz"
+        paths = dl_manager.download_and_extract(_BEER_URL)
+        self.beer_path = os.path.join(paths, "beer_2.0/beer")
+        self.float_pattern = re.compile(r"\d+\.\d+")
     def _compute(self, predictions, references):
+        if isinstance(references[0], list):
+            raise ValueError("Beer metric does not support multiple references")
+        try:
+            with tempfile.NamedTemporaryFile(mode="w", delete=False) as pred_file:
+                pred_file.write("\n".join(predictions))
+                pred_file.flush()
+                pred_file.close()
+                with tempfile.NamedTemporaryFile(mode="w", delete=False) as ref_file:
+                    ref_file.write("\n".join(references))
+                    ref_file.flush()
+                    ref_file.close()
+                    cmd = [self.beer_path, "-r", ref_file.name, "-s",pred_file.name, "--printSentScores"]
+                    output = subprocess.check_output(cmd).decode("utf-8")
+                    assert output.startswith("sent 1 score is "), "Unexpected output: {}".format(output)
+                    output = output.strip().split("\n")
+                    total_score = float(output[-1][11:])
+                    scores = [float(self.float_pattern.findall(s)[0]) for s in output[:-1]]
+                    return {"beer": total_score, "beer_scores": scores}
+        except Exception as e:
+            raise Exception("Error while computing beer score: {}".format(e))

tests.py CHANGED Viewed

@@ -1,17 +1,10 @@
 test_cases = [
     {
-        "predictions": [0, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0}
     },
-    {
-        "predictions": [1, 1],
-        "references": [1, 1],
-        "result": {"metric_score": 1}
-    },
-    {
-        "predictions": [1, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0.5}
-    }
-]

 test_cases = [
     {
+        "predictions": [
+            "It is a guide to action which ensures that the military always obeys the commands of the party",
+            "hello there general kenobi"],
+        "references": ["It is a guide to action that ensures that the military will forever heed Party commands",
+                       "hello general kenobi"],
+        "result": {'beer': 0.4557488704361114, 'beer_scores': [0.6380935618609037, 0.7291530494474304]}
     },
+]