Spaces:

evaluate-metric
/

frugalscore

Runtime error

App Files Files Community

lvwerra HF staff commited on May 20, 2022

Commit

f1d7582

•

1 Parent(s): c44d64a

Update Space (evaluate main: 828c6327)

Browse files

Files changed (4) hide show

README.md +106 -4
app.py +6 -0
frugalscore.py +117 -0
requirements.txt +5 -0

README.md CHANGED Viewed

@@ -1,12 +1,114 @@
 ---
-title: Frugalscore
-emoji: 📉
 colorFrom: blue
-colorTo: yellow
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title:
+emoji: 🤗
 colorFrom: blue
+colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
 ---
+## Metric Description
+FrugalScore is a reference-based metric for Natural Language Generation (NLG) model evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
+The FrugalScore models are obtained by continuing the pretraining of small models on a synthetic dataset constructed using summarization, backtranslation and denoising models. During the training, the small models learn the internal mapping of the expensive metric, including any similarity function.
+## How to use
+When loading FrugalScore, you can indicate the model you wish to use to compute the score. The default model is `moussaKam/frugalscore_tiny_bert-base_bert-score`, and a full list of models can be found in the [Limitations and bias](#Limitations-and-bias) section.
+```python
+>>> frugalscore = evaluate.load("frugalscore", "moussaKam/frugalscore_medium_bert-base_mover-score")
+```
+FrugalScore calculates how good are the predictions given some references, based on a set of scores.
+The inputs it takes are:
+`predictions`: a list of strings representing the predictions to score.
+`references`: a list of string representing the references for each prediction.
+Its optional arguments are:
+`batch_size`: the batch size for predictions (default value is `32`).
+`max_length`: the maximum sequence length (default value is `128`).
+`device`: either "gpu" or "cpu" (default value is `None`).
+```python
+>>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'], batch_size=16, max_length=64, device="gpu")
+```
+## Output values
+The output of FrugalScore is a dictionary with the list of scores for each prediction-reference pair:
+```python
+{'scores': [0.6307541, 0.6449357]}
+```
+### Values from popular papers
+The [original FrugalScore paper](https://arxiv.org/abs/2110.08559) reported that FrugalScore-Tiny retains 97.7/94.7% of the original performance compared to [BertScore](https://huggingface.co/metrics/bertscore) while running 54 times faster and having 84 times less parameters.
+## Examples
+Maximal values (exact match between `references` and `predictions`):
+```python
+>>> frugalscore = evaluate.load("frugalscore")
+>>> results = frugalscore.compute(predictions=['hello world'], references=['hello world'])
+>>> print(results)
+{'scores': [0.9891098]}
+```
+Partial values:
+```python
+>>> frugalscore = evaluate.load("frugalscore")
+>>> results = frugalscore.compute(predictions=['hello world'], references=['hugging face'])
+>>> print(results)
+{'scores': [0.42482382]}
+```
+## Limitations and bias
+FrugalScore is based on [BertScore](https://huggingface.co/metrics/bertscore) and [MoverScore](https://arxiv.org/abs/1909.02622), and the models used are based on the original models used for these scores.
+The full list of available models for FrugalScore is:
+| FrugalScore                                        | Student     | Teacher        | Method     |
+|----------------------------------------------------|-------------|----------------|------------|
+| [moussaKam/frugalscore_tiny_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_bert-base_bert-score)    | BERT-tiny   | BERT-Base      | BERTScore  |
+| [moussaKam/frugalscore_small_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_small_bert-base_bert-score)   | BERT-small  | BERT-Base      | BERTScore  |
+| [moussaKam/frugalscore_medium_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_bert-base_bert-score) | BERT-medium | BERT-Base      | BERTScore  |
+| [moussaKam/frugalscore_tiny_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_roberta_bert-score)     | BERT-tiny   | RoBERTa-Large  | BERTScore  |
+| [moussaKam/frugalscore_small_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_small_roberta_bert-score)     | BERT-small  | RoBERTa-Large  | BERTScore  |
+| [moussaKam/frugalscore_medium_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_roberta_bert-score)    | BERT-medium | RoBERTa-Large  | BERTScore  |
+| [moussaKam/frugalscore_tiny_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_deberta_bert-score)      | BERT-tiny   | DeBERTa-XLarge | BERTScore  |
+| [moussaKam/frugalscore_small_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_small_deberta_bert-score)     | BERT-small  | DeBERTa-XLarge | BERTScore  |
+| [moussaKam/frugalscore_medium_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_deberta_bert-score)    | BERT-medium | DeBERTa-XLarge | BERTScore  |
+| [moussaKam/frugalscore_tiny_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_tiny_bert-base_mover-score)   | BERT-tiny   | BERT-Base      | MoverScore |
+| [moussaKam/frugalscore_small_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_small_bert-base_mover-score)  | BERT-small  | BERT-Base      | MoverScore |
+| [moussaKam/frugalscore_medium_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_medium_bert-base_mover-score) | BERT-medium | BERT-Base      | MoverScore |
+Depending on the size of the model picked, the loading time will vary: the `tiny` models will load very quickly, whereas the `medium` ones can take several minutes, depending on your Internet connection.
+## Citation
+```bibtex
+@article{eddine2021frugalscore,
+  title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
+  author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
+  journal={arXiv preprint arXiv:2110.08559},
+  year={2021}
+}
+```
+## Further References
+- [Original FrugalScore code](https://github.com/moussaKam/FrugalScore)
+- [FrugalScore paper](https://arxiv.org/abs/2110.08559)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("frugalscore")
+launch_gradio_widget(module)

frugalscore.py ADDED Viewed

	@@ -0,0 +1,117 @@

+# Copyright 2022 The HuggingFace Datasets Authors and the current metric script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""FrugalScore metric."""
+import datasets
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
+import evaluate
+_CITATION = """\
+@article{eddine2021frugalscore,
+  title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
+  author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
+  journal={arXiv preprint arXiv:2110.08559},
+  year={2021}
+}
+"""
+_DESCRIPTION = """\
+FrugalScore is a reference-based metric for NLG models evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
+"""
+_KWARGS_DESCRIPTION = """
+Calculates how good are predictions given some references, using certain scores.
+Args:
+    predictions (list of str): list of predictions to score. Each predictions
+        should be a string.
+    references (list of str): list of reference for each prediction. Each
+        reference should be a string.
+    batch_size (int): the batch size for predictions.
+    max_length (int): maximum sequence length.
+    device (str): either gpu or cpu
+Returns:
+    scores (list of int): list of scores.
+Examples:
+    >>> frugalscore = evaluate.load("frugalscore")
+    >>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'])
+    >>> print([round(s, 3) for s in results["scores"]])
+    [0.631, 0.645]
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class FRUGALSCORE(evaluate.EvaluationModule):
+    def _info(self):
+        return evaluate.EvaluationModuleInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("string"),
+                    "references": datasets.Value("string"),
+                }
+            ),
+            homepage="https://github.com/moussaKam/FrugalScore",
+        )
+    def _download_and_prepare(self, dl_manager):
+        if self.config_name == "default":
+            checkpoint = "moussaKam/frugalscore_tiny_bert-base_bert-score"
+        else:
+            checkpoint = self.config_name
+        self.model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
+        self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+    def _compute(
+        self,
+        predictions,
+        references,
+        batch_size=32,
+        max_length=128,
+        device=None,
+    ):
+        """Returns the scores"""
+        assert len(predictions) == len(
+            references
+        ), "predictions and references should have the same number of sentences."
+        if device is not None:
+            assert device in ["gpu", "cpu"], "device should be either gpu or cpu."
+        else:
+            device = "gpu" if torch.cuda.is_available() else "cpu"
+        training_args = TrainingArguments(
+            "trainer",
+            fp16=(device == "gpu"),
+            per_device_eval_batch_size=batch_size,
+            report_to="all",
+            no_cuda=(device == "cpu"),
+            log_level="warning",
+        )
+        dataset = {"sentence1": predictions, "sentence2": references}
+        raw_datasets = datasets.Dataset.from_dict(dataset)
+        def tokenize_function(data):
+            return self.tokenizer(
+                data["sentence1"], data["sentence2"], max_length=max_length, truncation=True, padding=True
+            )
+        tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
+        tokenized_datasets.remove_columns(["sentence1", "sentence2"])
+        trainer = Trainer(self.model, training_args, tokenizer=self.tokenizer)
+        predictions = trainer.predict(tokenized_datasets)
+        return {"scores": list(predictions.predictions.squeeze(-1))}

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+# TODO: fix github to release
+git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
+datasets~=2.0
+torch
+transformers