Spaces:

evaluate-metric
/

character

Running

App Files Files Community

lvwerra HF staff commited on Dec 8, 2022

Commit

9f064dc

1 Parent(s): 1cd4639

Update Space (evaluate main: 544f1e8a)

Browse files

Files changed (4) hide show

README.md +100 -6
app.py +6 -0
character.py +169 -0
requirements.txt +2 -0

README.md CHANGED Viewed

@@ -1,12 +1,106 @@
 ---
-title: Character
-emoji: 😻
-colorFrom: green
-colorTo: blue
 sdk: gradio
-sdk_version: 3.12.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: CharacTER
+emoji: 🔤
+colorFrom: orange
+colorTo: red
 sdk: gradio
+sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
+- machine-translation
+description: >-
+  CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER).
 ---
+# Metric Card for CharacTER
+## Metric Description
+CharacTer is a character-level metric inspired by the translation edit rate (TER) metric. It is
+defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the
+reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit
+distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis
+word is considered to match a reference word and could be shifted, if the edit distance between them is below a
+threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the
+character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for
+normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower
+TER.
+## Intended Uses
+CharacTER was developed for machine translation evaluation.
+## How to Use
+```python
+import evaluate
+character = evaluate.load("character")
+# Single hyp/ref
+preds = ["this week the saudis denied information published in the new york times"]
+refs = ["saudi arabia denied this week information published in the american new york times"]
+results = character.compute(references=refs, predictions=preds)
+# Corpus example
+preds = ["this week the saudis denied information published in the new york times",
+         "this is in fact an estimate"]
+refs = ["saudi arabia denied this week information published in the american new york times",
+        "this is actually an estimate"]
+results = character.compute(references=refs, predictions=preds)
+```
+### Inputs
+- **predictions**: a single prediction or a list of predictions to score. Each prediction should be a string with
+     tokens separated by spaces.
+- **references**: a single reference or a list of reference for each prediction. Each reference should be a string with
+     tokens separated by spaces.
+### Output Values
+*=only when a list of references/hypotheses are given
+- **count** (*): how many parallel sentences were processed
+- **mean** (*): the mean CharacTER score
+- **median** (*): the median score
+- **std** (*): standard deviation of the score
+- **min** (*): smallest score
+- **max** (*): largest score
+- **cer_scores**: all scores, one per ref/hyp pair
+### Output Example
+```python
+{
+    'count': 2,
+    'mean': 0.3127282211789254,
+    'median': 0.3127282211789254,
+    'std': 0.07561653111280243,
+    'min': 0.25925925925925924,
+    'max': 0.36619718309859156,
+    'cer_scores': [0.36619718309859156, 0.25925925925925924]
+}
+```
+## Citation
+```bibtex
+@inproceedings{wang-etal-2016-character,
+    title = "{C}harac{T}er: Translation Edit Rate on Character Level",
+    author = "Wang, Weiyue  and
+      Peter, Jan-Thorsten  and
+      Rosendahl, Hendrik  and
+      Ney, Hermann",
+    booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers",
+    month = aug,
+    year = "2016",
+    address = "Berlin, Germany",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/W16-2342",
+    doi = "10.18653/v1/W16-2342",
+    pages = "505--510",
+}
+```
+## Further References
+- Repackaged version that is used in this HF implementation: [https://github.com/bramvanroy/CharacTER](https://github.com/bramvanroy/CharacTER)
+- Original version: [https://github.com/rwth-i6/CharacTER](https://github.com/rwth-i6/CharacTER)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("character")
+launch_gradio_widget(module)

character.py ADDED Viewed

	@@ -0,0 +1,169 @@

+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""CharacTER metric, a character-based TER variant, for machine translation."""
+import math
+from statistics import mean, median
+from typing import Iterable, List, Union
+import cer
+import datasets
+from cer import calculate_cer
+from datasets import Sequence, Value
+import evaluate
+_CITATION = """\
+@inproceedings{wang-etal-2016-character,
+    title = "{C}harac{T}er: Translation Edit Rate on Character Level",
+    author = "Wang, Weiyue  and
+      Peter, Jan-Thorsten  and
+      Rosendahl, Hendrik  and
+      Ney, Hermann",
+    booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers",
+    month = aug,
+    year = "2016",
+    address = "Berlin, Germany",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/W16-2342",
+    doi = "10.18653/v1/W16-2342",
+    pages = "505--510",
+}
+"""
+_DESCRIPTION = """\
+CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER). It is
+defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the
+reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit
+distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis
+word is considered to match a reference word and could be shifted, if the edit distance between them is below a
+threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the
+character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for
+normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower
+TER."""
+_KWARGS_DESCRIPTION = """
+Calculates how good the predictions are in terms of the CharacTER metric given some references.
+Args:
+    predictions: a list of predictions to score. Each prediction should be a string with
+     tokens separated by spaces.
+    references: a list of references for each prediction. You can also pass multiple references for each prediction,
+     so a list and in that list a sublist for each prediction for its related references. When multiple references are
+     given, the lowest (best) score is returned for that prediction-references pair.
+     Each reference should be a string with tokens separated by spaces.
+    aggregate: one of "mean", "sum", "median" to indicate how the scores of individual sentences should be
+     aggregated
+    return_all_scores: a boolean, indicating whether in addition to the aggregated score, also all individual
+     scores should be returned
+Returns:
+    cer_score: an aggregated score across all the items, based on 'aggregate'
+    cer_scores: (optionally, if 'return_all_scores' evaluates to True) a list of all scores, one per ref/hyp pair
+Examples:
+    >>> character_mt = evaluate.load("character")
+    >>> preds = ["this week the saudis denied information published in the new york times"]
+    >>> refs = ["saudi arabia denied this week information published in the american new york times"]
+    >>> character_mt.compute(references=refs, predictions=preds)
+    {'cer_score': 0.36619718309859156}
+    >>> preds = ["this week the saudis denied information published in the new york times",
+    ...          "this is in fact an estimate"]
+    >>> refs = ["saudi arabia denied this week information published in the american new york times",
+    ...         "this is actually an estimate"]
+    >>> character_mt.compute(references=refs, predictions=preds, aggregate="sum", return_all_scores=True)
+    {'cer_score': 0.6254564423578508, 'cer_scores': [0.36619718309859156, 0.25925925925925924]}
+    >>> preds = ["this week the saudis denied information published in the new york times"]
+    >>> refs = [["saudi arabia denied this week information published in the american new york times",
+    ...          "the saudis have denied new information published in the ny times"]]
+    >>> character_mt.compute(references=refs, predictions=preds, aggregate="median", return_all_scores=True)
+    {'cer_score': 0.36619718309859156, 'cer_scores': [0.36619718309859156]}
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Character(evaluate.Metric):
+    """CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER)."""
+    def _info(self):
+        return evaluate.MetricInfo(
+            module_type="metric",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=[
+                datasets.Features(
+                    {"predictions": Value("string", id="prediction"), "references": Value("string", id="reference")}
+                ),
+                datasets.Features(
+                    {
+                        "predictions": Value("string", id="prediction"),
+                        "references": Sequence(Value("string", id="reference"), id="references"),
+                    }
+                ),
+            ],
+            homepage="https://github.com/bramvanroy/CharacTER",
+            codebase_urls=["https://github.com/bramvanroy/CharacTER", "https://github.com/rwth-i6/CharacTER"],
+        )
+    def _compute(
+        self,
+        predictions: Iterable[str],
+        references: Union[Iterable[str], Iterable[Iterable[str]]],
+        aggregate: str = "mean",
+        return_all_scores: bool = False,
+    ):
+        if aggregate not in ("mean", "sum", "median"):
+            raise ValueError("'aggregate' must be one of 'sum', 'mean', 'median'")
+        predictions = [p.split() for p in predictions]
+        # Predictions and references have the same internal types (both lists of strings),
+        # so only one reference per prediction
+        if isinstance(references[0], str):
+            references = [r.split() for r in references]
+            scores_d = cer.calculate_cer_corpus(predictions, references)
+            cer_scores: List[float] = scores_d["cer_scores"]
+            if aggregate == "sum":
+                score = sum(cer_scores)
+            elif aggregate == "mean":
+                score = scores_d["mean"]
+            else:
+                score = scores_d["median"]
+        else:
+            # In the case of multiple references, we just find the "best score",
+            # i.e., the reference that the prediction is closest to, i.e. the lowest characTER score
+            references = [[r.split() for r in refs] for refs in references]
+            cer_scores = []
+            for pred, refs in zip(predictions, references):
+                min_score = math.inf
+                for ref in refs:
+                    score = calculate_cer(pred, ref)
+                    if score < min_score:
+                        min_score = score
+                cer_scores.append(min_score)
+            if aggregate == "sum":
+                score = sum(cer_scores)
+            elif aggregate == "mean":
+                score = mean(cer_scores)
+            else:
+                score = median(cer_scores)
+        # Return scores
+        if return_all_scores:
+            return {"cer_score": score, "cer_scores": cer_scores}
+        else:
+            return {"cer_score": score}

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ git+https://github.com/huggingface/evaluate@544f1e8a5f30663d59ed6ba94b2b7380e8b4c309
2	+ cer>=1.2.0