README.md · evaluate-metric/frugalscore at main

metadata

title: null
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  FrugalScore is a reference-based metric for NLG models evaluation. It is based
  on a distillation approach that allows to learn a fixed, low cost version of
  any expensive NLG metric, while retaining most of its original performance.

Metric Description

FrugalScore is a reference-based metric for Natural Language Generation (NLG) model evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.

The FrugalScore models are obtained by continuing the pretraining of small models on a synthetic dataset constructed using summarization, backtranslation and denoising models. During the training, the small models learn the internal mapping of the expensive metric, including any similarity function.

How to use

When loading FrugalScore, you can indicate the model you wish to use to compute the score. The default model is moussaKam/frugalscore_tiny_bert-base_bert-score, and a full list of models can be found in the Limitations and bias section.

>>> frugalscore = evaluate.load("frugalscore", "moussaKam/frugalscore_medium_bert-base_mover-score")

FrugalScore calculates how good are the predictions given some references, based on a set of scores.

The inputs it takes are:

predictions: a list of strings representing the predictions to score.

references: a list of string representing the references for each prediction.

Its optional arguments are:

batch_size: the batch size for predictions (default value is 32).

max_length: the maximum sequence length (default value is 128).

device: either "gpu" or "cpu" (default value is None).

>>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'], batch_size=16, max_length=64, device="gpu")

Output values

The output of FrugalScore is a dictionary with the list of scores for each prediction-reference pair:

{'scores': [0.6307541, 0.6449357]}

Values from popular papers

The original FrugalScore paper reported that FrugalScore-Tiny retains 97.7/94.7% of the original performance compared to BertScore while running 54 times faster and having 84 times less parameters.

Examples

Maximal values (exact match between references and predictions):

>>> frugalscore = evaluate.load("frugalscore")
>>> results = frugalscore.compute(predictions=['hello world'], references=['hello world'])
>>> print(results)
{'scores': [0.9891098]}

Partial values:

>>> frugalscore = evaluate.load("frugalscore")
>>> results = frugalscore.compute(predictions=['hello world'], references=['hugging face'])
>>> print(results)
{'scores': [0.42482382]}

Limitations and bias

FrugalScore is based on BertScore and MoverScore, and the models used are based on the original models used for these scores.

The full list of available models for FrugalScore is:

FrugalScore	Student	Teacher	Method
moussaKam/frugalscore_tiny_bert-base_bert-score	BERT-tiny	BERT-Base	BERTScore
moussaKam/frugalscore_small_bert-base_bert-score	BERT-small	BERT-Base	BERTScore
moussaKam/frugalscore_medium_bert-base_bert-score	BERT-medium	BERT-Base	BERTScore
moussaKam/frugalscore_tiny_roberta_bert-score	BERT-tiny	RoBERTa-Large	BERTScore
moussaKam/frugalscore_small_roberta_bert-score	BERT-small	RoBERTa-Large	BERTScore
moussaKam/frugalscore_medium_roberta_bert-score	BERT-medium	RoBERTa-Large	BERTScore
moussaKam/frugalscore_tiny_deberta_bert-score	BERT-tiny	DeBERTa-XLarge	BERTScore
moussaKam/frugalscore_small_deberta_bert-score	BERT-small	DeBERTa-XLarge	BERTScore
moussaKam/frugalscore_medium_deberta_bert-score	BERT-medium	DeBERTa-XLarge	BERTScore
moussaKam/frugalscore_tiny_bert-base_mover-score	BERT-tiny	BERT-Base	MoverScore
moussaKam/frugalscore_small_bert-base_mover-score	BERT-small	BERT-Base	MoverScore
moussaKam/frugalscore_medium_bert-base_mover-score	BERT-medium	BERT-Base	MoverScore

Depending on the size of the model picked, the loading time will vary: the tiny models will load very quickly, whereas the medium ones can take several minutes, depending on your Internet connection.

Citation

@article{eddine2021frugalscore,
  title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
  author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
  journal={arXiv preprint arXiv:2110.08559},
  year={2021}
}

Spaces:

evaluate-metric
/

frugalscore

Running