File size: 5,159 Bytes
eaeed82
53fc5a5
b5095e1
53fc5a5
 
eaeed82
49b38c8
eaeed82
 
53fc5a5
b5095e1
 
7ceaf30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e8c299
b5095e1
53fc5a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: CER
emoji: 🤗 
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
  Character error rate (CER) is a common metric of the performance of an automatic speech recognition system.
  
  CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information.
  
  Character error rate can be computed as:
  
  CER = (S + D + I) / N = (S + D + I) / (S + D + C)
  
  where
  
  S is the number of substitutions,
  D is the number of deletions,
  I is the number of insertions,
  C is the number of correct characters,
  N is the number of characters in the reference (N=S+D+C).
  
  CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the
  performance of the ASR system with a CER of 0 being a perfect score.
---

# Metric Card for CER

## Metric description

Character error rate (CER) is a common metric of the performance of an automatic speech recognition (ASR) system. CER is similar to Word Error Rate (WER), but operates on character instead of word. 

Character error rate can be computed as: 

`CER = (S + D + I) / N = (S + D + I) / (S + D + C)`

where

`S` is the number of substitutions, 

`D` is the number of deletions, 

`I` is the number of insertions, 

`C` is the number of correct characters, 

`N` is the number of characters in the reference (`N=S+D+C`). 


## How to use 

The metric takes two inputs: references (a list of references for each speech input) and predictions (a list of transcriptions to score).

```python
from evaluate import load
cer = load("cer")
cer_score = cer.compute(predictions=predictions, references=references)
```
## Output values

This metric outputs a float representing the character error rate.

```
print(cer_score)
0.34146341463414637
```

The **lower** the CER value, the **better** the performance of the ASR system, with a CER of 0 being a perfect score. 

However, CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions (see [Examples](#Examples) below).

### Values from popular papers

This metric is highly dependent on the content and quality of the dataset, and therefore users can expect very different values for the same model but on different datasets.

Multilingual datasets such as [Common Voice](https://huggingface.co/datasets/common_voice) report different CERs depending on the language, ranging from 0.02-0.03 for languages such as French and Italian, to 0.05-0.07 for English (see [here](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/ASR/CTC) for more values).

## Examples 

Perfect match between prediction and reference:

```python
from evaluate import load
cer = load("cer")
predictions = ["hello world", "good night moon"]
references = ["hello world", "good night moon"]
cer_score = cer.compute(predictions=predictions, references=references)
print(cer_score)
0.0
```

Partial match between prediction and reference:

```python
from evaluate import load
cer = load("cer")
predictions = ["this is the prediction", "there is an other sample"]
references = ["this is the reference", "there is another one"]
cer_score = cer.compute(predictions=predictions, references=references)
print(cer_score)
0.34146341463414637
```

No match between prediction and reference:

```python
from evaluate import load
cer = load("cer")
predictions = ["hello"]
references = ["gracias"]
cer_score = cer.compute(predictions=predictions, references=references)
print(cer_score)
1.0
```

CER above 1 due to insertion errors:

```python
from evaluate import load
cer = load("cer")
predictions = ["hello world"]
references = ["hello"]
cer_score = cer.compute(predictions=predictions, references=references)
print(cer_score)
1.2
```

## Limitations and bias

CER is useful for comparing different models for tasks such as automatic speech recognition (ASR) and optic character recognition (OCR), especially for multilingual datasets where WER is not suitable given the diversity of languages. However, CER provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort.

Also, in some cases, instead of reporting the raw CER, a normalized CER is reported where the number of mistakes is divided by the sum of the number of edit operations (`I` + `S` + `D`) and `C` (the number of correct characters), which results in CER values that fall within the range of 0–100%.


## Citation


```bibtex
@inproceedings{morris2004,
author = {Morris, Andrew and Maier, Viktoria and Green, Phil},
year = {2004},
month = {01},
pages = {},
title = {From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}
}
```

## Further References 

- [Hugging Face Tasks -- Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition)