lvwerra HF staff commited on
Commit
53fc5a5
1 Parent(s): eaeed82

Update Space (evaluate main: 828c6327)

Browse files
Files changed (5) hide show
  1. README.md +131 -5
  2. app.py +6 -0
  3. cer.py +159 -0
  4. requirements.txt +4 -0
  5. test_cer.py +128 -0
README.md CHANGED
@@ -1,12 +1,138 @@
1
  ---
2
- title: Cer
3
- emoji: 💻
4
- colorFrom: purple
5
- colorTo: blue
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: CER
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for CER
16
+
17
+ ## Metric description
18
+
19
+ Character error rate (CER) is a common metric of the performance of an automatic speech recognition (ASR) system. CER is similar to Word Error Rate (WER), but operates on character instead of word.
20
+
21
+ Character error rate can be computed as:
22
+
23
+ `CER = (S + D + I) / N = (S + D + I) / (S + D + C)`
24
+
25
+ where
26
+
27
+ `S` is the number of substitutions,
28
+
29
+ `D` is the number of deletions,
30
+
31
+ `I` is the number of insertions,
32
+
33
+ `C` is the number of correct characters,
34
+
35
+ `N` is the number of characters in the reference (`N=S+D+C`).
36
+
37
+
38
+ ## How to use
39
+
40
+ The metric takes two inputs: references (a list of references for each speech input) and predictions (a list of transcriptions to score).
41
+
42
+ ```python
43
+ from evaluate import load
44
+ cer = load("cer")
45
+ cer_score = cer.compute(predictions=predictions, references=references)
46
+ ```
47
+ ## Output values
48
+
49
+ This metric outputs a float representing the character error rate.
50
+
51
+ ```
52
+ print(cer_score)
53
+ 0.34146341463414637
54
+ ```
55
+
56
+ The **lower** the CER value, the **better** the performance of the ASR system, with a CER of 0 being a perfect score.
57
+
58
+ However, CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions (see [Examples](#Examples) below).
59
+
60
+ ### Values from popular papers
61
+
62
+ This metric is highly dependent on the content and quality of the dataset, and therefore users can expect very different values for the same model but on different datasets.
63
+
64
+ Multilingual datasets such as [Common Voice](https://huggingface.co/datasets/common_voice) report different CERs depending on the language, ranging from 0.02-0.03 for languages such as French and Italian, to 0.05-0.07 for English (see [here](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/ASR/CTC) for more values).
65
+
66
+ ## Examples
67
+
68
+ Perfect match between prediction and reference:
69
+
70
+ ```python
71
+ from evaluate import load
72
+ cer = load("cer")
73
+ predictions = ["hello world", "good night moon"]
74
+ references = ["hello world", "good night moon"]
75
+ cer_score = cer.compute(predictions=predictions, references=references)
76
+ print(cer_score)
77
+ 0.0
78
+ ```
79
+
80
+ Partial match between prediction and reference:
81
+
82
+ ```python
83
+ from evaluate import load
84
+ cer = load("cer")
85
+ predictions = ["this is the prediction", "there is an other sample"]
86
+ references = ["this is the reference", "there is another one"]
87
+ cer_score = cer.compute(predictions=predictions, references=references)
88
+ print(cer_score)
89
+ 0.34146341463414637
90
+ ```
91
+
92
+ No match between prediction and reference:
93
+
94
+ ```python
95
+ from evaluate import load
96
+ cer = load("cer")
97
+ predictions = ["hello"]
98
+ references = ["gracias"]
99
+ cer_score = cer.compute(predictions=predictions, references=references)
100
+ print(cer_score)
101
+ 1.0
102
+ ```
103
+
104
+ CER above 1 due to insertion errors:
105
+
106
+ ```python
107
+ from evaluate import load
108
+ cer = load("cer")
109
+ predictions = ["hello world"]
110
+ references = ["hello"]
111
+ cer_score = cer.compute(predictions=predictions, references=references)
112
+ print(cer_score)
113
+ 1.2
114
+ ```
115
+
116
+ ## Limitations and bias
117
+
118
+ CER is useful for comparing different models for tasks such as automatic speech recognition (ASR) and optic character recognition (OCR), especially for multilingual datasets where WER is not suitable given the diversity of languages. However, CER provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort.
119
+
120
+ Also, in some cases, instead of reporting the raw CER, a normalized CER is reported where the number of mistakes is divided by the sum of the number of edit operations (`I` + `S` + `D`) and `C` (the number of correct characters), which results in CER values that fall within the range of 0–100%.
121
+
122
+
123
+ ## Citation
124
+
125
+
126
+ ```bibtex
127
+ @inproceedings{morris2004,
128
+ author = {Morris, Andrew and Maier, Viktoria and Green, Phil},
129
+ year = {2004},
130
+ month = {01},
131
+ pages = {},
132
+ title = {From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}
133
+ }
134
+ ```
135
+
136
+ ## Further References
137
+
138
+ - [Hugging Face Tasks -- Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("cer")
6
+ launch_gradio_widget(module)
cer.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2021 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ Character Error Ratio (CER) metric. """
15
+
16
+ from typing import List
17
+
18
+ import datasets
19
+ import jiwer
20
+ import jiwer.transforms as tr
21
+ from datasets.config import PY_VERSION
22
+ from packaging import version
23
+
24
+ import evaluate
25
+
26
+
27
+ if PY_VERSION < version.parse("3.8"):
28
+ import importlib_metadata
29
+ else:
30
+ import importlib.metadata as importlib_metadata
31
+
32
+
33
+ SENTENCE_DELIMITER = ""
34
+
35
+
36
+ if version.parse(importlib_metadata.version("jiwer")) < version.parse("2.3.0"):
37
+
38
+ class SentencesToListOfCharacters(tr.AbstractTransform):
39
+ def __init__(self, sentence_delimiter: str = " "):
40
+ self.sentence_delimiter = sentence_delimiter
41
+
42
+ def process_string(self, s: str):
43
+ return list(s)
44
+
45
+ def process_list(self, inp: List[str]):
46
+ chars = []
47
+ for sent_idx, sentence in enumerate(inp):
48
+ chars.extend(self.process_string(sentence))
49
+ if self.sentence_delimiter is not None and self.sentence_delimiter != "" and sent_idx < len(inp) - 1:
50
+ chars.append(self.sentence_delimiter)
51
+ return chars
52
+
53
+ cer_transform = tr.Compose(
54
+ [tr.RemoveMultipleSpaces(), tr.Strip(), SentencesToListOfCharacters(SENTENCE_DELIMITER)]
55
+ )
56
+ else:
57
+ cer_transform = tr.Compose(
58
+ [
59
+ tr.RemoveMultipleSpaces(),
60
+ tr.Strip(),
61
+ tr.ReduceToSingleSentence(SENTENCE_DELIMITER),
62
+ tr.ReduceToListOfListOfChars(),
63
+ ]
64
+ )
65
+
66
+
67
+ _CITATION = """\
68
+ @inproceedings{inproceedings,
69
+ author = {Morris, Andrew and Maier, Viktoria and Green, Phil},
70
+ year = {2004},
71
+ month = {01},
72
+ pages = {},
73
+ title = {From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}
74
+ }
75
+ """
76
+
77
+ _DESCRIPTION = """\
78
+ Character error rate (CER) is a common metric of the performance of an automatic speech recognition system.
79
+
80
+ CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information.
81
+
82
+ Character error rate can be computed as:
83
+
84
+ CER = (S + D + I) / N = (S + D + I) / (S + D + C)
85
+
86
+ where
87
+
88
+ S is the number of substitutions,
89
+ D is the number of deletions,
90
+ I is the number of insertions,
91
+ C is the number of correct characters,
92
+ N is the number of characters in the reference (N=S+D+C).
93
+
94
+ CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the
95
+ performance of the ASR system with a CER of 0 being a perfect score.
96
+ """
97
+
98
+ _KWARGS_DESCRIPTION = """
99
+ Computes CER score of transcribed segments against references.
100
+ Args:
101
+ references: list of references for each speech input.
102
+ predictions: list of transcribtions to score.
103
+ concatenate_texts: Whether or not to concatenate sentences before evaluation, set to True for more accurate result.
104
+ Returns:
105
+ (float): the character error rate
106
+
107
+ Examples:
108
+
109
+ >>> predictions = ["this is the prediction", "there is an other sample"]
110
+ >>> references = ["this is the reference", "there is another one"]
111
+ >>> cer = evaluate.load("cer")
112
+ >>> cer_score = cer.compute(predictions=predictions, references=references)
113
+ >>> print(cer_score)
114
+ 0.34146341463414637
115
+ """
116
+
117
+
118
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
119
+ class CER(evaluate.EvaluationModule):
120
+ def _info(self):
121
+ return evaluate.EvaluationModuleInfo(
122
+ description=_DESCRIPTION,
123
+ citation=_CITATION,
124
+ inputs_description=_KWARGS_DESCRIPTION,
125
+ features=datasets.Features(
126
+ {
127
+ "predictions": datasets.Value("string", id="sequence"),
128
+ "references": datasets.Value("string", id="sequence"),
129
+ }
130
+ ),
131
+ codebase_urls=["https://github.com/jitsi/jiwer/"],
132
+ reference_urls=[
133
+ "https://en.wikipedia.org/wiki/Word_error_rate",
134
+ "https://sites.google.com/site/textdigitisation/qualitymeasures/computingerrorrates",
135
+ ],
136
+ )
137
+
138
+ def _compute(self, predictions, references, concatenate_texts=False):
139
+ if concatenate_texts:
140
+ return jiwer.compute_measures(
141
+ references,
142
+ predictions,
143
+ truth_transform=cer_transform,
144
+ hypothesis_transform=cer_transform,
145
+ )["wer"]
146
+
147
+ incorrect = 0
148
+ total = 0
149
+ for prediction, reference in zip(predictions, references):
150
+ measures = jiwer.compute_measures(
151
+ reference,
152
+ prediction,
153
+ truth_transform=cer_transform,
154
+ hypothesis_transform=cer_transform,
155
+ )
156
+ incorrect += measures["substitutions"] + measures["deletions"] + measures["insertions"]
157
+ total += measures["substitutions"] + measures["deletions"] + measures["hits"]
158
+
159
+ return incorrect / total
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ jiwer
test_cer.py ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2021 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ import unittest
15
+
16
+ from cer import CER
17
+
18
+
19
+ cer = CER()
20
+
21
+
22
+ class TestCER(unittest.TestCase):
23
+ def test_cer_case_senstive(self):
24
+ refs = ["White House"]
25
+ preds = ["white house"]
26
+ # S = 2, D = 0, I = 0, N = 11, CER = 2 / 11
27
+ char_error_rate = cer.compute(predictions=preds, references=refs)
28
+ self.assertTrue(abs(char_error_rate - 0.1818181818) < 1e-6)
29
+
30
+ def test_cer_whitespace(self):
31
+ refs = ["were wolf"]
32
+ preds = ["werewolf"]
33
+ # S = 0, D = 0, I = 1, N = 9, CER = 1 / 9
34
+ char_error_rate = cer.compute(predictions=preds, references=refs)
35
+ self.assertTrue(abs(char_error_rate - 0.1111111) < 1e-6)
36
+
37
+ refs = ["werewolf"]
38
+ preds = ["weae wolf"]
39
+ # S = 1, D = 1, I = 0, N = 8, CER = 0.25
40
+ char_error_rate = cer.compute(predictions=preds, references=refs)
41
+ self.assertTrue(abs(char_error_rate - 0.25) < 1e-6)
42
+
43
+ # consecutive whitespaces case 1
44
+ refs = ["were wolf"]
45
+ preds = ["were wolf"]
46
+ # S = 0, D = 0, I = 0, N = 9, CER = 0
47
+ char_error_rate = cer.compute(predictions=preds, references=refs)
48
+ self.assertTrue(abs(char_error_rate - 0.0) < 1e-6)
49
+
50
+ # consecutive whitespaces case 2
51
+ refs = ["were wolf"]
52
+ preds = ["were wolf"]
53
+ # S = 0, D = 0, I = 0, N = 9, CER = 0
54
+ char_error_rate = cer.compute(predictions=preds, references=refs)
55
+ self.assertTrue(abs(char_error_rate - 0.0) < 1e-6)
56
+
57
+ def test_cer_sub(self):
58
+ refs = ["werewolf"]
59
+ preds = ["weaewolf"]
60
+ # S = 1, D = 0, I = 0, N = 8, CER = 0.125
61
+ char_error_rate = cer.compute(predictions=preds, references=refs)
62
+ self.assertTrue(abs(char_error_rate - 0.125) < 1e-6)
63
+
64
+ def test_cer_del(self):
65
+ refs = ["werewolf"]
66
+ preds = ["wereawolf"]
67
+ # S = 0, D = 1, I = 0, N = 8, CER = 0.125
68
+ char_error_rate = cer.compute(predictions=preds, references=refs)
69
+ self.assertTrue(abs(char_error_rate - 0.125) < 1e-6)
70
+
71
+ def test_cer_insert(self):
72
+ refs = ["werewolf"]
73
+ preds = ["wereolf"]
74
+ # S = 0, D = 0, I = 1, N = 8, CER = 0.125
75
+ char_error_rate = cer.compute(predictions=preds, references=refs)
76
+ self.assertTrue(abs(char_error_rate - 0.125) < 1e-6)
77
+
78
+ def test_cer_equal(self):
79
+ refs = ["werewolf"]
80
+ char_error_rate = cer.compute(predictions=refs, references=refs)
81
+ self.assertEqual(char_error_rate, 0.0)
82
+
83
+ def test_cer_list_of_seqs(self):
84
+ refs = ["werewolf", "I am your father"]
85
+ char_error_rate = cer.compute(predictions=refs, references=refs)
86
+ self.assertEqual(char_error_rate, 0.0)
87
+
88
+ refs = ["werewolf", "I am your father", "doge"]
89
+ preds = ["werxwolf", "I am your father", "doge"]
90
+ # S = 1, D = 0, I = 0, N = 28, CER = 1 / 28
91
+ char_error_rate = cer.compute(predictions=preds, references=refs)
92
+ self.assertTrue(abs(char_error_rate - 0.03571428) < 1e-6)
93
+
94
+ def test_correlated_sentences(self):
95
+ refs = ["My hovercraft", "is full of eels"]
96
+ preds = ["My hovercraft is full", " of eels"]
97
+ # S = 0, D = 0, I = 2, N = 28, CER = 2 / 28
98
+ # whitespace at the front of " of eels" will be strip during preporcessing
99
+ # so need to insert 2 whitespaces
100
+ char_error_rate = cer.compute(predictions=preds, references=refs, concatenate_texts=True)
101
+ self.assertTrue(abs(char_error_rate - 0.071428) < 1e-6)
102
+
103
+ def test_cer_unicode(self):
104
+ refs = ["我能吞下玻璃而不伤身体"]
105
+ preds = [" 能吞虾玻璃而 不霜身体啦"]
106
+ # S = 3, D = 2, I = 0, N = 11, CER = 5 / 11
107
+ char_error_rate = cer.compute(predictions=preds, references=refs)
108
+ self.assertTrue(abs(char_error_rate - 0.4545454545) < 1e-6)
109
+
110
+ refs = ["我能吞下玻璃", "而不伤身体"]
111
+ preds = ["我 能 吞 下 玻 璃", "而不伤身体"]
112
+ # S = 0, D = 5, I = 0, N = 11, CER = 5 / 11
113
+ char_error_rate = cer.compute(predictions=preds, references=refs)
114
+ self.assertTrue(abs(char_error_rate - 0.454545454545) < 1e-6)
115
+
116
+ refs = ["我能吞下玻璃而不伤身体"]
117
+ char_error_rate = cer.compute(predictions=refs, references=refs)
118
+ self.assertFalse(char_error_rate, 0.0)
119
+
120
+ def test_cer_empty(self):
121
+ refs = [""]
122
+ preds = ["Hypothesis"]
123
+ with self.assertRaises(ValueError):
124
+ cer.compute(predictions=preds, references=refs)
125
+
126
+
127
+ if __name__ == "__main__":
128
+ unittest.main()