lvwerra HF staff commited on
Commit
2bb7fde
1 Parent(s): dd7d14f

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +134 -4
  2. app.py +6 -0
  3. comet.py +145 -0
  4. requirements.txt +5 -0
README.md CHANGED
@@ -1,12 +1,142 @@
1
  ---
2
- title: Comet
3
- emoji: 🐢
4
- colorFrom: yellow
5
  colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: COMET
3
+ emoji: 🤗
4
+ colorFrom: blue
5
  colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for COMET
16
+
17
+ ## Metric description
18
+
19
+ Crosslingual Optimized Metric for Evaluation of Translation (COMET) is an open-source framework used to train Machine Translation metrics that achieve high levels of correlation with different types of human judgments.
20
+
21
+ ## How to use
22
+
23
+ COMET takes 3 lists of strings as input: `sources` (a list of source sentences), `predictions` (a list of candidate translations) and `references` (a list of reference translations).
24
+
25
+ ```python
26
+ from evaluate import load
27
+ comet_metric = load('comet')
28
+ source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
29
+ hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
30
+ reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
31
+ comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
32
+ ```
33
+
34
+ It has several configurations, named after the COMET model to be used. It will default to `wmt20-comet-da` (previously known as `wmt-large-da-estimator-1719`). Alternate models that can be chosen include `wmt20-comet-qe-da`, `wmt21-comet-mqm`, `wmt21-cometinho-da`, `wmt21-comet-qe-mqm` and `emnlp20-comet-rank`.
35
+
36
+ It also has several optional arguments:
37
+
38
+ `gpus`: optional, an integer (number of GPUs to train on) or a list of integers (which GPUs to train on). Set to 0 to use CPU. The default value is `None` (uses one GPU if possible, else use CPU).
39
+
40
+ `progress_bar`a boolean -- if set to `True`, progress updates will be printed out. The default value is `False`.
41
+
42
+ More information about model characteristics can be found on the [COMET website](https://unbabel.github.io/COMET/html/models.html).
43
+
44
+ ## Output values
45
+
46
+ The COMET metric outputs two lists:
47
+
48
+ `scores`: a list of COMET scores for each of the input sentences, ranging from 0-1.
49
+
50
+ `mean_score`: the mean value of COMET scores `scores` over all the input sentences, ranging from 0-1.
51
+
52
+ ### Values from popular papers
53
+
54
+ The [original COMET paper](https://arxiv.org/pdf/2009.09025.pdf) reported average COMET scores ranging from 0.4 to 0.6, depending on the language pairs used for evaluating translation models. They also illustrate that COMET correlates well with human judgements compared to other metrics such as [BLEU](https://huggingface.co/metrics/bleu) and [CHRF](https://huggingface.co/metrics/chrf).
55
+
56
+ ## Examples
57
+
58
+ Full match:
59
+
60
+ ```python
61
+ from evaluate import load
62
+ comet_metric = load('comet')
63
+ source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
64
+ hypothesis = ["They were able to control the fire.", "Schools and kindergartens opened"]
65
+ reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
66
+ results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
67
+ print([round(v, 1) for v in results["scores"]])
68
+ [1.0, 1.0]
69
+ ```
70
+
71
+ Partial match:
72
+
73
+ ```python
74
+ from evaluate import load
75
+ comet_metric = load('comet')
76
+ source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
77
+ hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
78
+ reference = ["They were able to control the fire", "Schools and kindergartens opened"]
79
+ results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
80
+ print([round(v, 2) for v in results["scores"]])
81
+ [0.19, 0.92]
82
+ ```
83
+
84
+ No match:
85
+
86
+ ```python
87
+ from evaluate import load
88
+ comet_metric = load('comet')
89
+ source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
90
+ hypothesis = ["The girl went for a walk", "The boy was sleeping"]
91
+ reference = ["They were able to control the fire", "Schools and kindergartens opened"]
92
+ results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
93
+ print([round(v, 2) for v in results["scores"]])
94
+ [0.00, 0.00]
95
+ ```
96
+
97
+ ## Limitations and bias
98
+
99
+ The models provided for calculating the COMET metric are built on top of XLM-R and cover the following languages:
100
+
101
+ Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.
102
+
103
+ Thus, results for language pairs containing uncovered languages are unreliable, as per the [COMET website](https://github.com/Unbabel/COMET)
104
+
105
+ Also, calculating the COMET metric involves downloading the model from which features are obtained -- the default model, `wmt20-comet-da`, takes over 1.79GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `wmt21-cometinho-da` is 344MB.
106
+
107
+ ## Citation
108
+
109
+ ```bibtex
110
+ @inproceedings{rei-EtAl:2020:WMT,
111
+ author = {Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon},
112
+ title = {Unbabel's Participation in the WMT20 Metrics Shared Task},
113
+ booktitle = {Proceedings of the Fifth Conference on Machine Translation},
114
+ month = {November},
115
+ year = {2020},
116
+ address = {Online},
117
+ publisher = {Association for Computational Linguistics},
118
+ pages = {909--918},
119
+ }
120
+ ```
121
+
122
+ ```bibtex
123
+ @inproceedings{rei-etal-2020-comet,
124
+ title = "{COMET}: A Neural Framework for {MT} Evaluation",
125
+ author = "Rei, Ricardo and
126
+ Stewart, Craig and
127
+ Farinha, Ana C and
128
+ Lavie, Alon",
129
+ booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
130
+ month = nov,
131
+ year = "2020",
132
+ address = "Online",
133
+ publisher = "Association for Computational Linguistics",
134
+ url = "https://www.aclweb.org/anthology/2020.emnlp-main.213",
135
+ pages = "2685--2702",
136
+
137
+ ```
138
+
139
+ ## Further References
140
+
141
+ - [COMET website](https://unbabel.github.io/COMET/html/index.html)
142
+ - [Hugging Face Tasks - Machine Translation](https://huggingface.co/tasks/translation)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("comet")
6
+ launch_gradio_widget(module)
comet.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ COMET metric.
15
+
16
+ Requirements:
17
+ pip install unbabel-comet
18
+
19
+ Usage:
20
+
21
+ ```python
22
+ from evaluate import load
23
+ comet_metric = load('metrics/comet/comet.py')
24
+ #comet_metric = load('comet')
25
+ #comet_metric = load('comet', 'wmt-large-hter-estimator')
26
+
27
+
28
+ source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
29
+ hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
30
+ reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
31
+
32
+ predictions = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
33
+ predictions['scores']
34
+ ```
35
+ """
36
+
37
+ import comet # From: unbabel-comet
38
+ import datasets
39
+ import torch
40
+
41
+ import evaluate
42
+
43
+
44
+ logger = evaluate.logging.get_logger(__name__)
45
+
46
+ _CITATION = """\
47
+ @inproceedings{rei-EtAl:2020:WMT,
48
+ author = {Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon},
49
+ title = {Unbabel's Participation in the WMT20 Metrics Shared Task},
50
+ booktitle = {Proceedings of the Fifth Conference on Machine Translation},
51
+ month = {November},
52
+ year = {2020},
53
+ address = {Online},
54
+ publisher = {Association for Computational Linguistics},
55
+ pages = {909--918},
56
+ }
57
+ @inproceedings{rei-etal-2020-comet,
58
+ title = "{COMET}: A Neural Framework for {MT} Evaluation",
59
+ author = "Rei, Ricardo and
60
+ Stewart, Craig and
61
+ Farinha, Ana C and
62
+ Lavie, Alon",
63
+ booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
64
+ month = nov,
65
+ year = "2020",
66
+ address = "Online",
67
+ publisher = "Association for Computational Linguistics",
68
+ url = "https://www.aclweb.org/anthology/2020.emnlp-main.213",
69
+ pages = "2685--2702",
70
+ }
71
+ """
72
+
73
+ _DESCRIPTION = """\
74
+ Crosslingual Optimized Metric for Evaluation of Translation (COMET) is an open-source framework used to train Machine Translation metrics that achieve high levels of correlation with different types of human judgments (HTER, DA's or MQM).
75
+ With the release of the framework the authors also released fully trained models that were used to compete in the WMT20 Metrics Shared Task achieving SOTA in that years competition.
76
+
77
+ See the [README.md] file at https://unbabel.github.io/COMET/html/models.html for more information.
78
+ """
79
+
80
+ _KWARGS_DESCRIPTION = """
81
+ COMET score.
82
+
83
+ Args:
84
+
85
+ `sources` (list of str): Source sentences
86
+ `predictions` (list of str): candidate translations
87
+ `references` (list of str): reference translations
88
+ `cuda` (bool): If set to True, runs COMET using GPU
89
+ `show_progress` (bool): Shows progress
90
+ `model`: COMET model to be used. Will default to `wmt-large-da-estimator-1719` if None.
91
+
92
+ Returns:
93
+ `samples`: List of dictionaries with `src`, `mt`, `ref` and `score`.
94
+ `scores`: List of scores.
95
+
96
+ Examples:
97
+
98
+ >>> comet_metric = evaluate.load('comet')
99
+ >>> # comet_metric = load('comet', 'wmt20-comet-da') # you can also choose which model to use
100
+ >>> source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
101
+ >>> hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
102
+ >>> reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
103
+ >>> results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
104
+ >>> print([round(v, 2) for v in results["scores"]])
105
+ [0.19, 0.92]
106
+ """
107
+
108
+
109
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
110
+ class COMET(evaluate.EvaluationModule):
111
+ def _info(self):
112
+
113
+ return evaluate.EvaluationModuleInfo(
114
+ description=_DESCRIPTION,
115
+ citation=_CITATION,
116
+ homepage="https://unbabel.github.io/COMET/html/index.html",
117
+ inputs_description=_KWARGS_DESCRIPTION,
118
+ features=datasets.Features(
119
+ {
120
+ "sources": datasets.Value("string", id="sequence"),
121
+ "predictions": datasets.Value("string", id="sequence"),
122
+ "references": datasets.Value("string", id="sequence"),
123
+ }
124
+ ),
125
+ codebase_urls=["https://github.com/Unbabel/COMET"],
126
+ reference_urls=[
127
+ "https://github.com/Unbabel/COMET",
128
+ "https://www.aclweb.org/anthology/2020.emnlp-main.213/",
129
+ "http://www.statmt.org/wmt20/pdf/2020.wmt-1.101.pdf6",
130
+ ],
131
+ )
132
+
133
+ def _download_and_prepare(self, dl_manager):
134
+ if self.config_name == "default":
135
+ self.scorer = comet.load_from_checkpoint(comet.download_model("wmt20-comet-da"))
136
+ else:
137
+ self.scorer = comet.load_from_checkpoint(comet.download_model(self.config_name))
138
+
139
+ def _compute(self, sources, predictions, references, gpus=None, progress_bar=False):
140
+ if gpus is None:
141
+ gpus = 1 if torch.cuda.is_available() else 0
142
+ data = {"src": sources, "mt": predictions, "ref": references}
143
+ data = [dict(zip(data, t)) for t in zip(*data.values())]
144
+ scores, mean_score = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
145
+ return {"mean_score": mean_score, "scores": scores}
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ unbabel-comet
5
+ torch