Spaces:

evaluate-metric
/

bleurt

Running

App Files Files Community

lvwerra HF Staff commited on May 20, 2022

Commit

981697b

1 Parent(s): aa5e3a7

Update Space (evaluate main: 828c6327)

Browse files

Files changed (4) hide show

README.md +94 -4
app.py +6 -0
bleurt.py +125 -0
requirements.txt +4 -0

README.md CHANGED Viewed

@@ -1,12 +1,102 @@
 ---
-title: Bleurt
-emoji: 🐨
-colorFrom: green
 colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title: BLEURT
+emoji: 🤗
+colorFrom: blue
 colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
 ---
+# Metric Card for BLEURT
+## Metric Description
+BLEURT is a learned evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model [Devlin et al. 2018](https://arxiv.org/abs/1810.04805), employing another pre-training phrase using synthetic data, and finally trained on WMT human annotations.
+It is possible to run BLEURT out-of-the-box or fine-tune it for your specific application (the latter is expected to perform better).
+See the project's [README](https://github.com/google-research/bleurt#readme) for more information.
+## Intended Uses
+BLEURT is intended to be used for evaluating text produced by language models.
+## How to Use
+This metric takes as input lists of predicted sentences and reference sentences:
+```python
+>>> predictions = ["hello there", "general kenobi"]
+>>> references = ["hello there", "general kenobi"]
+>>> bleurt = load("bleurt", type="metric")
+>>> results = bleurt.compute(predictions=predictions, references=references)
+```
+### Inputs
+- **predictions** (`list` of `str`s): List of generated sentences to score.
+- **references** (`list` of `str`s): List of references to compare to.
+- **checkpoint** (`str`): BLEURT checkpoint. Will default to `BLEURT-tiny` if not specified. Other models that can be chosen are: `"bleurt-tiny-128"`, `"bleurt-tiny-512"`, `"bleurt-base-128"`, `"bleurt-base-512"`, `"bleurt-large-128"`, `"bleurt-large-512"`, `"BLEURT-20-D3"`, `"BLEURT-20-D6"`, `"BLEURT-20-D12"` and `"BLEURT-20"`.
+### Output Values
+- **scores** : a `list` of scores, one per prediction.
+Output Example:
+```python
+{'scores': [1.0295498371124268, 1.0445425510406494]}
+```
+BLEURT's output is always a number between 0 and (approximately 1). This value indicates how similar the generated text is to the reference texts, with values closer to 1 representing more similar texts.
+#### Values from Popular Papers
+The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) reported that the metric is better correlated with human judgment compared to similar metrics such as BERT and BERTscore.
+BLEURT is used to compare models across different asks (e.g. (Table to text generation)[https://paperswithcode.com/sota/table-to-text-generation-on-dart?metric=BLEURT]).
+### Examples
+Example with the default model:
+```python
+>>> predictions = ["hello there", "general kenobi"]
+>>> references = ["hello there", "general kenobi"]
+>>> bleurt = load("bleurt", type="metric")
+>>> results = bleurt.compute(predictions=predictions, references=references)
+>>> print(results)
+{'scores': [1.0295498371124268, 1.0445425510406494]}
+```
+Example with the `"bleurt-base-128"` model checkpoint:
+```python
+>>> predictions = ["hello there", "general kenobi"]
+>>> references = ["hello there", "general kenobi"]
+>>> bleurt = load("bleurt", type="metric", checkpoint="bleurt-base-128")
+>>> results = bleurt.compute(predictions=predictions, references=references)
+>>> print(results)
+{'scores': [1.0295498371124268, 1.0445425510406494]}
+```
+## Limitations and Bias
+The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) showed that BLEURT correlates well with human judgment, but this depends on the model and language pair selected.
+Furthermore, currently BLEURT only supports English-language scoring, given that it leverages models trained on English corpora. It may also reflect, to a certain extent, biases and correlations that were present in the model training data.
+Finally, calculating the BLEURT metric involves downloading the BLEURT model that is used to compute the score, which can take a significant amount of time depending on the model chosen. Starting with the default model, `bleurt-tiny`, and testing out larger models if necessary can be a useful approach if memory or internet speed is an issue.
+## Citation
+```bibtex
+@inproceedings{bleurt,
+  title={BLEURT: Learning Robust Metrics for Text Generation},
+  author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
+  booktitle={ACL},
+  year={2020},
+  url={https://arxiv.org/abs/2004.04696}
+}
+```
+## Further References
+- The original [BLEURT GitHub repo](https://github.com/google-research/bleurt/)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("bleurt")
+launch_gradio_widget(module)

bleurt.py ADDED Viewed

	@@ -0,0 +1,125 @@

+# Copyright 2020 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BLEURT metric. """
+import os
+import datasets
+from bleurt import score  # From: git+https://github.com/google-research/bleurt.git
+import evaluate
+logger = evaluate.logging.get_logger(__name__)
+_CITATION = """\
+@inproceedings{bleurt,
+  title={BLEURT: Learning Robust Metrics for Text Generation},
+  author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
+  booktitle={ACL},
+  year={2020},
+  url={https://arxiv.org/abs/2004.04696}
+}
+"""
+_DESCRIPTION = """\
+BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model (Devlin et al. 2018)
+and then employing another pre-training phrase using synthetic data. Finally it is trained on WMT human annotations. You may run BLEURT out-of-the-box or fine-tune
+it for your specific application (the latter is expected to perform better).
+See the project's README at https://github.com/google-research/bleurt#readme for more information.
+"""
+_KWARGS_DESCRIPTION = """
+BLEURT score.
+Args:
+    `predictions` (list of str): prediction/candidate sentences
+    `references` (list of str): reference sentences
+    `checkpoint` BLEURT checkpoint. Will default to BLEURT-tiny if None.
+Returns:
+    'scores': List of scores.
+Examples:
+    >>> predictions = ["hello there", "general kenobi"]
+    >>> references = ["hello there", "general kenobi"]
+    >>> bleurt = evaluate.load("bleurt")
+    >>> results = bleurt.compute(predictions=predictions, references=references)
+    >>> print([round(v, 2) for v in results["scores"]])
+    [1.03, 1.04]
+"""
+CHECKPOINT_URLS = {
+    "bleurt-tiny-128": "https://storage.googleapis.com/bleurt-oss/bleurt-tiny-128.zip",
+    "bleurt-tiny-512": "https://storage.googleapis.com/bleurt-oss/bleurt-tiny-512.zip",
+    "bleurt-base-128": "https://storage.googleapis.com/bleurt-oss/bleurt-base-128.zip",
+    "bleurt-base-512": "https://storage.googleapis.com/bleurt-oss/bleurt-base-512.zip",
+    "bleurt-large-128": "https://storage.googleapis.com/bleurt-oss/bleurt-large-128.zip",
+    "bleurt-large-512": "https://storage.googleapis.com/bleurt-oss/bleurt-large-512.zip",
+    "BLEURT-20-D3": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D3.zip",
+    "BLEURT-20-D6": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D6.zip",
+    "BLEURT-20-D12": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D12.zip",
+    "BLEURT-20": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip",
+}
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class BLEURT(evaluate.EvaluationModule):
+    def _info(self):
+        return evaluate.EvaluationModuleInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            homepage="https://github.com/google-research/bleurt",
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("string", id="sequence"),
+                    "references": datasets.Value("string", id="sequence"),
+                }
+            ),
+            codebase_urls=["https://github.com/google-research/bleurt"],
+            reference_urls=["https://github.com/google-research/bleurt", "https://arxiv.org/abs/2004.04696"],
+        )
+    def _download_and_prepare(self, dl_manager):
+        # check that config name specifies a valid BLEURT model
+        if self.config_name == "default":
+            logger.warning(
+                "Using default BLEURT-Base checkpoint for sequence maximum length 128. "
+                "You can use a bigger model for better results with e.g.: evaluate.load('bleurt', 'bleurt-large-512')."
+            )
+            self.config_name = "bleurt-base-128"
+        if self.config_name.lower() in CHECKPOINT_URLS:
+            checkpoint_name = self.config_name.lower()
+        elif self.config_name.upper() in CHECKPOINT_URLS:
+            checkpoint_name = self.config_name.upper()
+        else:
+            raise KeyError(
+                f"{self.config_name} model not found. You should supply the name of a model checkpoint for bleurt in {CHECKPOINT_URLS.keys()}"
+            )
+        # download the model checkpoint specified by self.config_name and set up the scorer
+        model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[checkpoint_name])
+        self.scorer = score.BleurtScorer(os.path.join(model_path, checkpoint_name))
+    def _compute(self, predictions, references):
+        scores = self.scorer.score(references=references, candidates=predictions)
+        return {"scores": scores}

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+# TODO: fix github to release
+git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
+datasets~=2.0
+git+https://github.com/google-research/bleurt.git