Spaces:
Runtime error
Runtime error
Update Space (evaluate main: 828c6327)
Browse files- README.md +106 -4
- app.py +6 -0
- frugalscore.py +117 -0
- requirements.txt +5 -0
README.md
CHANGED
@@ -1,12 +1,114 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
colorFrom: blue
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title:
|
3 |
+
emoji: 🤗
|
4 |
colorFrom: blue
|
5 |
+
colorTo: red
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
+
tags:
|
11 |
+
- evaluate
|
12 |
+
- metric
|
13 |
---
|
14 |
|
15 |
+
|
16 |
+
## Metric Description
|
17 |
+
FrugalScore is a reference-based metric for Natural Language Generation (NLG) model evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
|
18 |
+
|
19 |
+
The FrugalScore models are obtained by continuing the pretraining of small models on a synthetic dataset constructed using summarization, backtranslation and denoising models. During the training, the small models learn the internal mapping of the expensive metric, including any similarity function.
|
20 |
+
|
21 |
+
## How to use
|
22 |
+
|
23 |
+
When loading FrugalScore, you can indicate the model you wish to use to compute the score. The default model is `moussaKam/frugalscore_tiny_bert-base_bert-score`, and a full list of models can be found in the [Limitations and bias](#Limitations-and-bias) section.
|
24 |
+
|
25 |
+
```python
|
26 |
+
>>> frugalscore = evaluate.load("frugalscore", "moussaKam/frugalscore_medium_bert-base_mover-score")
|
27 |
+
```
|
28 |
+
|
29 |
+
FrugalScore calculates how good are the predictions given some references, based on a set of scores.
|
30 |
+
|
31 |
+
The inputs it takes are:
|
32 |
+
|
33 |
+
`predictions`: a list of strings representing the predictions to score.
|
34 |
+
|
35 |
+
`references`: a list of string representing the references for each prediction.
|
36 |
+
|
37 |
+
Its optional arguments are:
|
38 |
+
|
39 |
+
`batch_size`: the batch size for predictions (default value is `32`).
|
40 |
+
|
41 |
+
`max_length`: the maximum sequence length (default value is `128`).
|
42 |
+
|
43 |
+
`device`: either "gpu" or "cpu" (default value is `None`).
|
44 |
+
|
45 |
+
```python
|
46 |
+
>>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'], batch_size=16, max_length=64, device="gpu")
|
47 |
+
```
|
48 |
+
|
49 |
+
## Output values
|
50 |
+
|
51 |
+
The output of FrugalScore is a dictionary with the list of scores for each prediction-reference pair:
|
52 |
+
```python
|
53 |
+
{'scores': [0.6307541, 0.6449357]}
|
54 |
+
```
|
55 |
+
|
56 |
+
### Values from popular papers
|
57 |
+
The [original FrugalScore paper](https://arxiv.org/abs/2110.08559) reported that FrugalScore-Tiny retains 97.7/94.7% of the original performance compared to [BertScore](https://huggingface.co/metrics/bertscore) while running 54 times faster and having 84 times less parameters.
|
58 |
+
|
59 |
+
## Examples
|
60 |
+
|
61 |
+
Maximal values (exact match between `references` and `predictions`):
|
62 |
+
|
63 |
+
```python
|
64 |
+
>>> frugalscore = evaluate.load("frugalscore")
|
65 |
+
>>> results = frugalscore.compute(predictions=['hello world'], references=['hello world'])
|
66 |
+
>>> print(results)
|
67 |
+
{'scores': [0.9891098]}
|
68 |
+
```
|
69 |
+
|
70 |
+
Partial values:
|
71 |
+
|
72 |
+
```python
|
73 |
+
>>> frugalscore = evaluate.load("frugalscore")
|
74 |
+
>>> results = frugalscore.compute(predictions=['hello world'], references=['hugging face'])
|
75 |
+
>>> print(results)
|
76 |
+
{'scores': [0.42482382]}
|
77 |
+
```
|
78 |
+
|
79 |
+
## Limitations and bias
|
80 |
+
|
81 |
+
FrugalScore is based on [BertScore](https://huggingface.co/metrics/bertscore) and [MoverScore](https://arxiv.org/abs/1909.02622), and the models used are based on the original models used for these scores.
|
82 |
+
|
83 |
+
The full list of available models for FrugalScore is:
|
84 |
+
|
85 |
+
| FrugalScore | Student | Teacher | Method |
|
86 |
+
|----------------------------------------------------|-------------|----------------|------------|
|
87 |
+
| [moussaKam/frugalscore_tiny_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_bert-base_bert-score) | BERT-tiny | BERT-Base | BERTScore |
|
88 |
+
| [moussaKam/frugalscore_small_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_small_bert-base_bert-score) | BERT-small | BERT-Base | BERTScore |
|
89 |
+
| [moussaKam/frugalscore_medium_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_bert-base_bert-score) | BERT-medium | BERT-Base | BERTScore |
|
90 |
+
| [moussaKam/frugalscore_tiny_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_roberta_bert-score) | BERT-tiny | RoBERTa-Large | BERTScore |
|
91 |
+
| [moussaKam/frugalscore_small_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_small_roberta_bert-score) | BERT-small | RoBERTa-Large | BERTScore |
|
92 |
+
| [moussaKam/frugalscore_medium_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_roberta_bert-score) | BERT-medium | RoBERTa-Large | BERTScore |
|
93 |
+
| [moussaKam/frugalscore_tiny_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_deberta_bert-score) | BERT-tiny | DeBERTa-XLarge | BERTScore |
|
94 |
+
| [moussaKam/frugalscore_small_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_small_deberta_bert-score) | BERT-small | DeBERTa-XLarge | BERTScore |
|
95 |
+
| [moussaKam/frugalscore_medium_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_deberta_bert-score) | BERT-medium | DeBERTa-XLarge | BERTScore |
|
96 |
+
| [moussaKam/frugalscore_tiny_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_tiny_bert-base_mover-score) | BERT-tiny | BERT-Base | MoverScore |
|
97 |
+
| [moussaKam/frugalscore_small_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_small_bert-base_mover-score) | BERT-small | BERT-Base | MoverScore |
|
98 |
+
| [moussaKam/frugalscore_medium_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_medium_bert-base_mover-score) | BERT-medium | BERT-Base | MoverScore |
|
99 |
+
|
100 |
+
Depending on the size of the model picked, the loading time will vary: the `tiny` models will load very quickly, whereas the `medium` ones can take several minutes, depending on your Internet connection.
|
101 |
+
|
102 |
+
## Citation
|
103 |
+
```bibtex
|
104 |
+
@article{eddine2021frugalscore,
|
105 |
+
title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
|
106 |
+
author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
|
107 |
+
journal={arXiv preprint arXiv:2110.08559},
|
108 |
+
year={2021}
|
109 |
+
}
|
110 |
+
```
|
111 |
+
|
112 |
+
## Further References
|
113 |
+
- [Original FrugalScore code](https://github.com/moussaKam/FrugalScore)
|
114 |
+
- [FrugalScore paper](https://arxiv.org/abs/2110.08559)
|
app.py
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import evaluate
|
2 |
+
from evaluate.utils import launch_gradio_widget
|
3 |
+
|
4 |
+
|
5 |
+
module = evaluate.load("frugalscore")
|
6 |
+
launch_gradio_widget(module)
|
frugalscore.py
ADDED
@@ -0,0 +1,117 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Copyright 2022 The HuggingFace Datasets Authors and the current metric script contributor.
|
2 |
+
#
|
3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
4 |
+
# you may not use this file except in compliance with the License.
|
5 |
+
# You may obtain a copy of the License at
|
6 |
+
#
|
7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
8 |
+
#
|
9 |
+
# Unless required by applicable law or agreed to in writing, software
|
10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
+
# See the License for the specific language governing permissions and
|
13 |
+
# limitations under the License.
|
14 |
+
"""FrugalScore metric."""
|
15 |
+
|
16 |
+
import datasets
|
17 |
+
import torch
|
18 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
|
19 |
+
|
20 |
+
import evaluate
|
21 |
+
|
22 |
+
|
23 |
+
_CITATION = """\
|
24 |
+
@article{eddine2021frugalscore,
|
25 |
+
title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
|
26 |
+
author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
|
27 |
+
journal={arXiv preprint arXiv:2110.08559},
|
28 |
+
year={2021}
|
29 |
+
}
|
30 |
+
"""
|
31 |
+
|
32 |
+
_DESCRIPTION = """\
|
33 |
+
FrugalScore is a reference-based metric for NLG models evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
|
34 |
+
"""
|
35 |
+
|
36 |
+
|
37 |
+
_KWARGS_DESCRIPTION = """
|
38 |
+
Calculates how good are predictions given some references, using certain scores.
|
39 |
+
Args:
|
40 |
+
predictions (list of str): list of predictions to score. Each predictions
|
41 |
+
should be a string.
|
42 |
+
references (list of str): list of reference for each prediction. Each
|
43 |
+
reference should be a string.
|
44 |
+
batch_size (int): the batch size for predictions.
|
45 |
+
max_length (int): maximum sequence length.
|
46 |
+
device (str): either gpu or cpu
|
47 |
+
Returns:
|
48 |
+
scores (list of int): list of scores.
|
49 |
+
Examples:
|
50 |
+
>>> frugalscore = evaluate.load("frugalscore")
|
51 |
+
>>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'])
|
52 |
+
>>> print([round(s, 3) for s in results["scores"]])
|
53 |
+
[0.631, 0.645]
|
54 |
+
"""
|
55 |
+
|
56 |
+
|
57 |
+
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
58 |
+
class FRUGALSCORE(evaluate.EvaluationModule):
|
59 |
+
def _info(self):
|
60 |
+
return evaluate.EvaluationModuleInfo(
|
61 |
+
description=_DESCRIPTION,
|
62 |
+
citation=_CITATION,
|
63 |
+
inputs_description=_KWARGS_DESCRIPTION,
|
64 |
+
features=datasets.Features(
|
65 |
+
{
|
66 |
+
"predictions": datasets.Value("string"),
|
67 |
+
"references": datasets.Value("string"),
|
68 |
+
}
|
69 |
+
),
|
70 |
+
homepage="https://github.com/moussaKam/FrugalScore",
|
71 |
+
)
|
72 |
+
|
73 |
+
def _download_and_prepare(self, dl_manager):
|
74 |
+
if self.config_name == "default":
|
75 |
+
checkpoint = "moussaKam/frugalscore_tiny_bert-base_bert-score"
|
76 |
+
else:
|
77 |
+
checkpoint = self.config_name
|
78 |
+
self.model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
|
79 |
+
self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
|
80 |
+
|
81 |
+
def _compute(
|
82 |
+
self,
|
83 |
+
predictions,
|
84 |
+
references,
|
85 |
+
batch_size=32,
|
86 |
+
max_length=128,
|
87 |
+
device=None,
|
88 |
+
):
|
89 |
+
"""Returns the scores"""
|
90 |
+
assert len(predictions) == len(
|
91 |
+
references
|
92 |
+
), "predictions and references should have the same number of sentences."
|
93 |
+
if device is not None:
|
94 |
+
assert device in ["gpu", "cpu"], "device should be either gpu or cpu."
|
95 |
+
else:
|
96 |
+
device = "gpu" if torch.cuda.is_available() else "cpu"
|
97 |
+
training_args = TrainingArguments(
|
98 |
+
"trainer",
|
99 |
+
fp16=(device == "gpu"),
|
100 |
+
per_device_eval_batch_size=batch_size,
|
101 |
+
report_to="all",
|
102 |
+
no_cuda=(device == "cpu"),
|
103 |
+
log_level="warning",
|
104 |
+
)
|
105 |
+
dataset = {"sentence1": predictions, "sentence2": references}
|
106 |
+
raw_datasets = datasets.Dataset.from_dict(dataset)
|
107 |
+
|
108 |
+
def tokenize_function(data):
|
109 |
+
return self.tokenizer(
|
110 |
+
data["sentence1"], data["sentence2"], max_length=max_length, truncation=True, padding=True
|
111 |
+
)
|
112 |
+
|
113 |
+
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
|
114 |
+
tokenized_datasets.remove_columns(["sentence1", "sentence2"])
|
115 |
+
trainer = Trainer(self.model, training_args, tokenizer=self.tokenizer)
|
116 |
+
predictions = trainer.predict(tokenized_datasets)
|
117 |
+
return {"scores": list(predictions.predictions.squeeze(-1))}
|
requirements.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# TODO: fix github to release
|
2 |
+
git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
|
3 |
+
datasets~=2.0
|
4 |
+
torch
|
5 |
+
transformers
|