Manuel de Prada commited on
Commit
80dcff0
1 Parent(s): 0f80edd

beer metric

Browse files
Files changed (3) hide show
  1. README.md +58 -27
  2. beer.py +85 -52
  3. tests.py +7 -14
README.md CHANGED
@@ -1,50 +1,81 @@
1
  ---
2
  title: BEER
3
- datasets:
4
- -
5
- tags:
6
- - evaluate
7
- - metric
8
- description: "TODO: add a description here"
9
  sdk: gradio
10
  sdk_version: 3.19.1
11
  app_file: app.py
12
  pinned: false
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
  # Metric Card for BEER
16
 
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 
 
 
 
 
 
18
 
19
- ## Metric Description
20
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
21
 
22
- ## How to Use
23
- *Give general statement of how to use the metric*
24
 
25
- *Provide simplest possible example for using the metric*
26
 
27
- ### Inputs
28
- *List all input arguments in the format below*
29
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
30
 
31
- ### Output Values
 
32
 
33
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
34
 
35
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 
 
 
 
 
36
 
37
- #### Values from Popular Papers
38
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 
 
 
 
 
39
 
40
- ### Examples
41
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
42
 
43
- ## Limitations and Bias
44
- *Note any known limitations or biases that the metric has, with links and references if possible.*
45
 
46
  ## Citation
47
- *Cite the source where this metric was introduced.*
48
 
49
- ## Further References
50
- *Add any useful further references.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: BEER
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
 
 
 
6
  sdk: gradio
7
  sdk_version: 3.19.1
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
+ description: >-
14
+ BEER 2.0 (BEtter Evaluation as Ranking) is a trained machine translation evaluation metric with high correlation with human judgment both on sentence and corpus level. It is a linear model-based metric for sentence-level evaluation in machine translation (MT) that combines 33 relatively dense features, including character n-grams and reordering features.
15
+ It employs a learning-to-rank framework to differentiate between function and non-function words and weighs each word type according to its importance for evaluation.
16
+ The model is trained on ranking similar translations using a vector of feature values for each system output.
17
+ BEER outperforms the strong baseline metric METEOR in five out of eight language pairs, showing that less sparse features at the sentence level can lead to state-of-the-art results.
18
+ Features on character n-grams are crucial, and higher-order character n-grams are less prone to sparse counts than word n-grams.
19
  ---
20
 
21
  # Metric Card for BEER
22
 
23
+ ## Metric description
24
+
25
+ BEER 2.0 (BEtter Evaluation as Ranking) is a trained machine translation evaluation metric with high correlation with human judgment both on sentence and corpus level. It is a linear model-based metric for sentence-level evaluation in machine translation (MT) that combines 33 relatively dense features, including character n-grams and reordering features.
26
+ It employs a learning-to-rank framework to differentiate between function and non-function words and weighs each word type according to its importance for evaluation.
27
+ The model is trained on ranking similar translations using a vector of feature values for each system output.
28
+ BEER outperforms the strong baseline metric METEOR in five out of eight language pairs, showing that less sparse features at the sentence level can lead to state-of-the-art results.
29
+ Features on character n-grams are crucial, and higher-order character n-grams are less prone to sparse counts than word n-grams.
30
 
31
+ ## How to use
 
32
 
33
+ BEER has two mandatory arguments:
 
34
 
35
+ `predictions`: a `list` of predictions to score. Each prediction should be a string with tokens separated by spaces.
36
 
37
+ `references`: a `list` of references (multiple `references` per `prediction` are not allowed). Each reference should be a string with tokens separated by spaces.
 
 
38
 
39
+ ## Prerequisites
40
+ This module downloads and executes the original authors' BEER package. You must have Java installed to run it, and it will fail to load otherwise.
41
 
42
+ Since it is not Python code and calls the BEER executable, it is much faster to pass a batch of predicitions and references to evaluate in a single call than to iteratively call the metric with one prediction and reference at a time.
43
 
44
+ ```python
45
+ >>> meteor = evaluate.load('beer')
46
+ >>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party", "hello there general kenobi"]
47
+ >>> references = ["It is a guide to action that ensures that the military will forever heed Party commands", "hello general kenobi"]
48
+ >>> results = meteor.compute(predictions=predictions, references=references)
49
+ ```
50
 
51
+ ## Output values
52
+
53
+ The metric outputs a dictionary containing the BEER score and the scores of the individual sentences:
54
+ ```
55
+ {'beer': 0.4557488704361114,
56
+ 'beer_scores': [0.6380935618609037, 0.7291530494474304]}
57
+ ```
58
 
 
 
59
 
 
 
60
 
61
  ## Citation
 
62
 
63
+ ```bibtex
64
+ @inproceedings{stanojevic-simaan-2014-fitting,
65
+ title = "Fitting Sentence Level Translation Evaluation with Many Dense Features",
66
+ author = "Stanojevi{\'c}, Milo{\v{s}} and
67
+ Sima{'}an, Khalil",
68
+ booktitle = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})",
69
+ month = oct,
70
+ year = "2014",
71
+ address = "Doha, Qatar",
72
+ publisher = "Association for Computational Linguistics",
73
+ url = "https://aclanthology.org/D14-1025",
74
+ doi = "10.3115/v1/D14-1025",
75
+ pages = "202--206",
76
+ }
77
+ ```
78
+
79
+ ## Further References
80
+ - [BEER -- Official GitHub](https://github.com/stanojevic/beer)
81
+
beer.py CHANGED
@@ -1,4 +1,4 @@
1
- # Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
2
  #
3
  # Licensed under the Apache License, Version 2.0 (the "License");
4
  # you may not use this file except in compliance with the License.
@@ -11,85 +11,118 @@
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
- """TODO: Add a description here."""
15
-
16
- import evaluate
17
  import datasets
 
 
 
18
 
19
-
20
- # TODO: Add BibTeX citation
21
  _CITATION = """\
22
- @InProceedings{huggingface:module,
23
- title = {A great new module},
24
- authors={huggingface, Inc.},
25
- year={2020}
 
 
 
 
 
 
 
26
  }
27
  """
28
 
29
- # TODO: Add description of the module here
30
  _DESCRIPTION = """\
31
- This new module is designed to solve this great ML task and is crafted with a lot of care.
32
- """
 
33
 
 
 
 
 
 
 
34
 
35
- # TODO: Add description of the arguments of the module here
36
  _KWARGS_DESCRIPTION = """
37
- Calculates how good are predictions given some references, using certain scores
38
  Args:
39
- predictions: list of predictions to score. Each predictions
40
  should be a string with tokens separated by spaces.
41
  references: list of reference for each prediction. Each
42
  reference should be a string with tokens separated by spaces.
43
  Returns:
44
- accuracy: description of the first score,
45
- another_score: description of the second score,
46
  Examples:
47
- Examples should be written in doctest format, and should illustrate how
48
- to use the function.
49
 
50
- >>> my_new_module = evaluate.load("my_new_module")
51
- >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
52
- >>> print(results)
53
- {'accuracy': 1.0}
 
 
54
  """
55
 
56
- # TODO: Define external resources urls if needed
57
- BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
58
-
59
 
60
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
61
- class BEER(evaluate.Metric):
62
- """TODO: Short description of my evaluation module."""
63
-
64
  def _info(self):
65
- # TODO: Specifies the evaluate.EvaluationModuleInfo object
66
  return evaluate.MetricInfo(
67
- # This is the description that will appear on the modules page.
68
- module_type="metric",
69
  description=_DESCRIPTION,
70
  citation=_CITATION,
71
  inputs_description=_KWARGS_DESCRIPTION,
72
- # This defines the format of each prediction and reference
73
- features=datasets.Features({
74
- 'predictions': datasets.Value('int64'),
75
- 'references': datasets.Value('int64'),
76
- }),
77
- # Homepage of the module for documentation
78
- homepage="http://module.homepage",
79
- # Additional links to the codebase or references
80
- codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
81
- reference_urls=["http://path.to.reference.url/new_module"]
 
 
 
 
 
 
 
 
82
  )
83
 
84
  def _download_and_prepare(self, dl_manager):
85
- """Optional: download external resources useful to compute the scores"""
86
- # TODO: Download external resources if needed
87
- pass
 
 
 
 
 
 
88
 
89
  def _compute(self, predictions, references):
90
- """Returns the scores"""
91
- # TODO: Compute the different scores of the module
92
- accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
93
- return {
94
- "accuracy": accuracy,
95
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
  #
3
  # Licensed under the Apache License, Version 2.0 (the "License");
4
  # you may not use this file except in compliance with the License.
 
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
+ """ BEER metric. """
15
+ import os
16
+ import re
17
  import datasets
18
+ import evaluate
19
+ import subprocess
20
+ import tempfile
21
 
 
 
22
  _CITATION = """\
23
+ @inproceedings{banarjee2005,
24
+ title = {Fitting Sentence Level Translation Evaluation with Many Dense Features},
25
+ author = {Stanojevi{\'c}, Milo{\v{s}} and Sima{'}an, Khalil},
26
+ booktitle = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})",
27
+ month = oct,
28
+ year = "2014",
29
+ address = "Doha, Qatar",
30
+ publisher = "Association for Computational Linguistics",
31
+ url = "https://aclanthology.org/D14-1025",
32
+ doi = "10.3115/v1/D14-1025",
33
+ pages = "202--206",
34
  }
35
  """
36
 
 
37
  _DESCRIPTION = """\
38
+ BEER is a linear model-based metric for sentence-level evaluation in machine translation (MT) that combines 33 relatively dense features, including character n-grams and reordering features.
39
+
40
+ It employs a learning-to-rank framework to differentiate between function and non-function words and weighs each word type according to its importance for evaluation.
41
 
42
+ The model is trained on ranking similar translations using a vector of feature values for each system output.
43
+
44
+ BEER outperforms the strong baseline metric METEOR in five out of eight language pairs, showing that less sparse features at the sentence level can lead to state-of-the-art results.
45
+
46
+ Features on character n-grams are crucial, and higher-order character n-grams are less prone to sparse counts than word n-grams.
47
+ """
48
 
 
49
  _KWARGS_DESCRIPTION = """
50
+ Computes BEER score of translated segments against one or more references.
51
  Args:
52
+ predictions: list of predictions to score. Each prediction
53
  should be a string with tokens separated by spaces.
54
  references: list of reference for each prediction. Each
55
  reference should be a string with tokens separated by spaces.
56
  Returns:
57
+ 'beer': beer score.
58
+ 'scores': list of scores for each sentence.
59
  Examples:
 
 
60
 
61
+ >>> beer = evaluate.load('beer')
62
+ >>> predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
63
+ >>> references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
64
+ >>> results = beer.compute(predictions=predictions, references=references)
65
+ >>> print(round(results["beer"], 4))
66
+ 0.3190
67
  """
68
 
 
 
 
69
 
70
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
71
+ class Beer(evaluate.Metric):
 
 
72
  def _info(self):
 
73
  return evaluate.MetricInfo(
 
 
74
  description=_DESCRIPTION,
75
  citation=_CITATION,
76
  inputs_description=_KWARGS_DESCRIPTION,
77
+ features=[
78
+ datasets.Features(
79
+ {
80
+ "predictions": datasets.Value("string", id="sequence"),
81
+ "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
82
+ }
83
+ ),
84
+ datasets.Features(
85
+ {
86
+ "predictions": datasets.Value("string", id="sequence"),
87
+ "references": datasets.Value("string", id="sequence"),
88
+ }
89
+ ),
90
+ ],
91
+ codebase_urls=["https://github.com/stanojevic/beer"],
92
+ reference_urls=[
93
+ "http://aclweb.org/anthology/D14-1025",
94
+ ],
95
  )
96
 
97
  def _download_and_prepare(self, dl_manager):
98
+ try:
99
+ subprocess.check_output(["java", "-version"], stderr=subprocess.STDOUT)
100
+ except Exception as e:
101
+ raise Exception("Java is not installed. Please install java and try again.")
102
+ dl_manager = datasets.download.DownloadManager()
103
+ _BEER_URL = "https://raw.githubusercontent.com/stanojevic/beer/master/packaged/beer_2.0.tar.gz"
104
+ paths = dl_manager.download_and_extract(_BEER_URL)
105
+ self.beer_path = os.path.join(paths, "beer_2.0/beer")
106
+ self.float_pattern = re.compile(r"\d+\.\d+")
107
 
108
  def _compute(self, predictions, references):
109
+ if isinstance(references[0], list):
110
+ raise ValueError("Beer metric does not support multiple references")
111
+ try:
112
+ with tempfile.NamedTemporaryFile(mode="w", delete=False) as pred_file:
113
+ pred_file.write("\n".join(predictions))
114
+ pred_file.flush()
115
+ pred_file.close()
116
+ with tempfile.NamedTemporaryFile(mode="w", delete=False) as ref_file:
117
+ ref_file.write("\n".join(references))
118
+ ref_file.flush()
119
+ ref_file.close()
120
+ cmd = [self.beer_path, "-r", ref_file.name, "-s",pred_file.name, "--printSentScores"]
121
+ output = subprocess.check_output(cmd).decode("utf-8")
122
+ assert output.startswith("sent 1 score is "), "Unexpected output: {}".format(output)
123
+ output = output.strip().split("\n")
124
+ total_score = float(output[-1][11:])
125
+ scores = [float(self.float_pattern.findall(s)[0]) for s in output[:-1]]
126
+ return {"beer": total_score, "beer_scores": scores}
127
+ except Exception as e:
128
+ raise Exception("Error while computing beer score: {}".format(e))
tests.py CHANGED
@@ -1,17 +1,10 @@
1
  test_cases = [
2
  {
3
- "predictions": [0, 0],
4
- "references": [1, 1],
5
- "result": {"metric_score": 0}
 
 
 
6
  },
7
- {
8
- "predictions": [1, 1],
9
- "references": [1, 1],
10
- "result": {"metric_score": 1}
11
- },
12
- {
13
- "predictions": [1, 0],
14
- "references": [1, 1],
15
- "result": {"metric_score": 0.5}
16
- }
17
- ]
 
1
  test_cases = [
2
  {
3
+ "predictions": [
4
+ "It is a guide to action which ensures that the military always obeys the commands of the party",
5
+ "hello there general kenobi"],
6
+ "references": ["It is a guide to action that ensures that the military will forever heed Party commands",
7
+ "hello general kenobi"],
8
+ "result": {'beer': 0.4557488704361114, 'beer_scores': [0.6380935618609037, 0.7291530494474304]}
9
  },
10
+ ]