lvwerra HF staff commited on
Commit
f1d7582
1 Parent(s): c44d64a

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +106 -4
  2. app.py +6 -0
  3. frugalscore.py +117 -0
  4. requirements.txt +5 -0
README.md CHANGED
@@ -1,12 +1,114 @@
1
  ---
2
- title: Frugalscore
3
- emoji: 📉
4
  colorFrom: blue
5
- colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title:
3
+ emoji: 🤗
4
  colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+
16
+ ## Metric Description
17
+ FrugalScore is a reference-based metric for Natural Language Generation (NLG) model evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
18
+
19
+ The FrugalScore models are obtained by continuing the pretraining of small models on a synthetic dataset constructed using summarization, backtranslation and denoising models. During the training, the small models learn the internal mapping of the expensive metric, including any similarity function.
20
+
21
+ ## How to use
22
+
23
+ When loading FrugalScore, you can indicate the model you wish to use to compute the score. The default model is `moussaKam/frugalscore_tiny_bert-base_bert-score`, and a full list of models can be found in the [Limitations and bias](#Limitations-and-bias) section.
24
+
25
+ ```python
26
+ >>> frugalscore = evaluate.load("frugalscore", "moussaKam/frugalscore_medium_bert-base_mover-score")
27
+ ```
28
+
29
+ FrugalScore calculates how good are the predictions given some references, based on a set of scores.
30
+
31
+ The inputs it takes are:
32
+
33
+ `predictions`: a list of strings representing the predictions to score.
34
+
35
+ `references`: a list of string representing the references for each prediction.
36
+
37
+ Its optional arguments are:
38
+
39
+ `batch_size`: the batch size for predictions (default value is `32`).
40
+
41
+ `max_length`: the maximum sequence length (default value is `128`).
42
+
43
+ `device`: either "gpu" or "cpu" (default value is `None`).
44
+
45
+ ```python
46
+ >>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'], batch_size=16, max_length=64, device="gpu")
47
+ ```
48
+
49
+ ## Output values
50
+
51
+ The output of FrugalScore is a dictionary with the list of scores for each prediction-reference pair:
52
+ ```python
53
+ {'scores': [0.6307541, 0.6449357]}
54
+ ```
55
+
56
+ ### Values from popular papers
57
+ The [original FrugalScore paper](https://arxiv.org/abs/2110.08559) reported that FrugalScore-Tiny retains 97.7/94.7% of the original performance compared to [BertScore](https://huggingface.co/metrics/bertscore) while running 54 times faster and having 84 times less parameters.
58
+
59
+ ## Examples
60
+
61
+ Maximal values (exact match between `references` and `predictions`):
62
+
63
+ ```python
64
+ >>> frugalscore = evaluate.load("frugalscore")
65
+ >>> results = frugalscore.compute(predictions=['hello world'], references=['hello world'])
66
+ >>> print(results)
67
+ {'scores': [0.9891098]}
68
+ ```
69
+
70
+ Partial values:
71
+
72
+ ```python
73
+ >>> frugalscore = evaluate.load("frugalscore")
74
+ >>> results = frugalscore.compute(predictions=['hello world'], references=['hugging face'])
75
+ >>> print(results)
76
+ {'scores': [0.42482382]}
77
+ ```
78
+
79
+ ## Limitations and bias
80
+
81
+ FrugalScore is based on [BertScore](https://huggingface.co/metrics/bertscore) and [MoverScore](https://arxiv.org/abs/1909.02622), and the models used are based on the original models used for these scores.
82
+
83
+ The full list of available models for FrugalScore is:
84
+
85
+ | FrugalScore | Student | Teacher | Method |
86
+ |----------------------------------------------------|-------------|----------------|------------|
87
+ | [moussaKam/frugalscore_tiny_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_bert-base_bert-score) | BERT-tiny | BERT-Base | BERTScore |
88
+ | [moussaKam/frugalscore_small_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_small_bert-base_bert-score) | BERT-small | BERT-Base | BERTScore |
89
+ | [moussaKam/frugalscore_medium_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_bert-base_bert-score) | BERT-medium | BERT-Base | BERTScore |
90
+ | [moussaKam/frugalscore_tiny_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_roberta_bert-score) | BERT-tiny | RoBERTa-Large | BERTScore |
91
+ | [moussaKam/frugalscore_small_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_small_roberta_bert-score) | BERT-small | RoBERTa-Large | BERTScore |
92
+ | [moussaKam/frugalscore_medium_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_roberta_bert-score) | BERT-medium | RoBERTa-Large | BERTScore |
93
+ | [moussaKam/frugalscore_tiny_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_deberta_bert-score) | BERT-tiny | DeBERTa-XLarge | BERTScore |
94
+ | [moussaKam/frugalscore_small_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_small_deberta_bert-score) | BERT-small | DeBERTa-XLarge | BERTScore |
95
+ | [moussaKam/frugalscore_medium_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_deberta_bert-score) | BERT-medium | DeBERTa-XLarge | BERTScore |
96
+ | [moussaKam/frugalscore_tiny_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_tiny_bert-base_mover-score) | BERT-tiny | BERT-Base | MoverScore |
97
+ | [moussaKam/frugalscore_small_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_small_bert-base_mover-score) | BERT-small | BERT-Base | MoverScore |
98
+ | [moussaKam/frugalscore_medium_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_medium_bert-base_mover-score) | BERT-medium | BERT-Base | MoverScore |
99
+
100
+ Depending on the size of the model picked, the loading time will vary: the `tiny` models will load very quickly, whereas the `medium` ones can take several minutes, depending on your Internet connection.
101
+
102
+ ## Citation
103
+ ```bibtex
104
+ @article{eddine2021frugalscore,
105
+ title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
106
+ author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
107
+ journal={arXiv preprint arXiv:2110.08559},
108
+ year={2021}
109
+ }
110
+ ```
111
+
112
+ ## Further References
113
+ - [Original FrugalScore code](https://github.com/moussaKam/FrugalScore)
114
+ - [FrugalScore paper](https://arxiv.org/abs/2110.08559)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("frugalscore")
6
+ launch_gradio_widget(module)
frugalscore.py ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2022 The HuggingFace Datasets Authors and the current metric script contributor.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """FrugalScore metric."""
15
+
16
+ import datasets
17
+ import torch
18
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
19
+
20
+ import evaluate
21
+
22
+
23
+ _CITATION = """\
24
+ @article{eddine2021frugalscore,
25
+ title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
26
+ author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
27
+ journal={arXiv preprint arXiv:2110.08559},
28
+ year={2021}
29
+ }
30
+ """
31
+
32
+ _DESCRIPTION = """\
33
+ FrugalScore is a reference-based metric for NLG models evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
34
+ """
35
+
36
+
37
+ _KWARGS_DESCRIPTION = """
38
+ Calculates how good are predictions given some references, using certain scores.
39
+ Args:
40
+ predictions (list of str): list of predictions to score. Each predictions
41
+ should be a string.
42
+ references (list of str): list of reference for each prediction. Each
43
+ reference should be a string.
44
+ batch_size (int): the batch size for predictions.
45
+ max_length (int): maximum sequence length.
46
+ device (str): either gpu or cpu
47
+ Returns:
48
+ scores (list of int): list of scores.
49
+ Examples:
50
+ >>> frugalscore = evaluate.load("frugalscore")
51
+ >>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'])
52
+ >>> print([round(s, 3) for s in results["scores"]])
53
+ [0.631, 0.645]
54
+ """
55
+
56
+
57
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
58
+ class FRUGALSCORE(evaluate.EvaluationModule):
59
+ def _info(self):
60
+ return evaluate.EvaluationModuleInfo(
61
+ description=_DESCRIPTION,
62
+ citation=_CITATION,
63
+ inputs_description=_KWARGS_DESCRIPTION,
64
+ features=datasets.Features(
65
+ {
66
+ "predictions": datasets.Value("string"),
67
+ "references": datasets.Value("string"),
68
+ }
69
+ ),
70
+ homepage="https://github.com/moussaKam/FrugalScore",
71
+ )
72
+
73
+ def _download_and_prepare(self, dl_manager):
74
+ if self.config_name == "default":
75
+ checkpoint = "moussaKam/frugalscore_tiny_bert-base_bert-score"
76
+ else:
77
+ checkpoint = self.config_name
78
+ self.model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
79
+ self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
80
+
81
+ def _compute(
82
+ self,
83
+ predictions,
84
+ references,
85
+ batch_size=32,
86
+ max_length=128,
87
+ device=None,
88
+ ):
89
+ """Returns the scores"""
90
+ assert len(predictions) == len(
91
+ references
92
+ ), "predictions and references should have the same number of sentences."
93
+ if device is not None:
94
+ assert device in ["gpu", "cpu"], "device should be either gpu or cpu."
95
+ else:
96
+ device = "gpu" if torch.cuda.is_available() else "cpu"
97
+ training_args = TrainingArguments(
98
+ "trainer",
99
+ fp16=(device == "gpu"),
100
+ per_device_eval_batch_size=batch_size,
101
+ report_to="all",
102
+ no_cuda=(device == "cpu"),
103
+ log_level="warning",
104
+ )
105
+ dataset = {"sentence1": predictions, "sentence2": references}
106
+ raw_datasets = datasets.Dataset.from_dict(dataset)
107
+
108
+ def tokenize_function(data):
109
+ return self.tokenizer(
110
+ data["sentence1"], data["sentence2"], max_length=max_length, truncation=True, padding=True
111
+ )
112
+
113
+ tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
114
+ tokenized_datasets.remove_columns(["sentence1", "sentence2"])
115
+ trainer = Trainer(self.model, training_args, tokenizer=self.tokenizer)
116
+ predictions = trainer.predict(tokenized_datasets)
117
+ return {"scores": list(predictions.predictions.squeeze(-1))}
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ torch
5
+ transformers