File size: 8,375 Bytes
b064a39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2deac9d
b064a39
 
 
 
 
5a0105c
b064a39
 
8259150
b064a39
 
 
 
 
 
 
 
2deac9d
 
 
 
 
 
 
 
 
b064a39
 
 
 
 
 
2deac9d
b064a39
2deac9d
 
 
 
b064a39
2deac9d
 
 
 
b064a39
2deac9d
 
b064a39
 
2deac9d
 
 
 
 
 
 
 
 
b064a39
2deac9d
 
b064a39
2deac9d
 
b064a39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbe522b
 
b064a39
 
 
 
3ea6f44
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
from pathlib import Path

# Directory where request by models are stored
DIR_OUTPUT_REQUESTS = Path("requested_models")
EVAL_REQUESTS_PATH = Path("eval_requests")

##########################
# Text definitions       #
##########################

banner_url = "https://huggingface.co/datasets/reach-vb/random-images/resolve/main/asr_leaderboard.png"
BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>'

TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> πŸ€— Open Automatic Speech Recognition Leaderboard </b> </body> </html>"

INTRODUCTION_TEXT = "πŸ“ The πŸ€— Open ASR Leaderboard ranks and evaluates speech recognition models \
    on the Hugging Face Hub. \
    \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower the better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher the better). Models are ranked based on their Average WER, from lowest to highest. Check the πŸ“ˆ Metrics tab to understand how the models are evaluated. \
    \nIf you want results for a model that is not listed here, you can submit a request for it to be included βœ‰οΈβœ¨. \
    \nThe leaderboard currently focuses on English speech recognition, and will be expanded to multilingual evaluation in later versions."

CITATION_TEXT = """@misc{open-asr-leaderboard,
	title        = {Open Automatic Speech Recognition Leaderboard},
	author       = {Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and others},
	year         = 2023,
	publisher    = {Hugging Face},
	howpublished = "\\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}"
}
"""

METRICS_TAB_TEXT = """
Here you will find details about the speech recognition metrics and datasets reported in our leaderboard.

## Metrics

Models are evaluated jointly using the Word Error Rate (WER) and Inverse Real Time Factor (RTFx) metrics. The WER metric
is used to assess the accuracy of a system, and the RTFx the inference speed. Models are ranked in the leaderboard based 
on their WER, lowest to highest.

Crucially, the WER and RTFx values are computed for the same inference run using a single script. The implication of this is two-fold:
1. The WER and RTFx values are coupled: for a given WER, one can expect to achieve the corresponding RTFx. This allows the proposer to trade-off lower WER for higher RTFx should they wish.
2. The WER and RTFx values are averaged over all audios in the benchmark (in the order of thousands of audios).

For details on reproducing the benchmark numbers, refer to the [Open ASR GitHub repository](https://github.com/huggingface/open_asr_leaderboard#evaluate-a-model).

### Word Error Rate (WER)

Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage 
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.

Take the following example:

| Reference:  | the | cat | sat     | on  | the | mat |
|-------------|-----|-----|---------|-----|-----|-----|
| Prediction: | the | cat | **sit** | on  | the |     |  |
| Label:      | βœ…   | βœ…   | S       | βœ…   | βœ…   | D   |

Here, we have:
* 1 substitution ("sit" instead of "sat")
* 0 insertions
* 1 deletion ("mat" is missing)

This gives 2 errors in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our
reference (N), which for this example is 6:

```
WER = (S + I + D) / N = (1 + 0 + 1) / 6 = 0.333
```

Giving a WER of 0.33, or 33%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions. You can find the evaluation code on our [Github repository](https://github.com/huggingface/open_asr_leaderboard). To read more about how the WER is computed, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/evaluation).

### Inverse Real Time Factor (RTFx)

Inverse Real Time Factor is a measure of  the **latency** of automatic speech recognition systems, i.e. how long it takes an 
model to process a given amount of speech. It is defined as:
```
RTFx = (number of seconds of audio inferred) / (compute time in seconds)
``` 

Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time. 
Thus, **a higher RTFx value indicates lower latency**.

## How to reproduce our results

The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible. 
Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations.
For more details head over to our repo at: https://github.com/huggingface/open_asr_leaderboard 

P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! β™₯️

## Benchmark datasets

Evaluating Speech Recognition systems is a hard problem. We use the multi-dataset benchmarking strategy proposed in the 
[ESB paper](https://arxiv.org/abs/2210.13352) to obtain robust evaluation scores for each model.

ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad 
set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains, 
acoustic conditions, speaker styles, and transcription requirements. As such, it gives a better indication of how 
a model is likely to perform on downstream ASR compared to evaluating it on one dataset alone.

The ESB score is calculated as a macro-average of the WER scores across the ESB datasets. The models in the leaderboard
are ranked based on their average WER scores, from lowest to highest.

| Dataset                                                                                 | Domain                      | Speaking Style        | Train (h) | Dev (h) | Test (h) | Transcriptions     | License         |
|-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------|
| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)                          | Audiobook                   | Narrated              | 960       | 11      | 11       | Normalised         | CC-BY-4.0       |
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli)                         | European Parliament         | Oratory               | 523       | 5       | 5        | Punctuated         | CC0             |
| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium)                                | TED talks                   | Oratory               | 454       | 2       | 3        | Normalised         | CC-BY-NC-ND 3.0 |
| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)                    | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500      | 12      | 40       | Punctuated         | apache-2.0      |
| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech)                         | Financial meetings          | Oratory, spontaneous  | 4900      | 100     | 100      | Punctuated & Cased | User Agreement  |
| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22)                     | Financial meetings          | Oratory, spontaneous  | 105       | 5       | 5        | Punctuated & Cased | CC-BY-SA-4.0    |
| [AMI](https://huggingface.co/datasets/edinburghcstr/ami)                                | Meetings                    | Spontaneous           | 78        | 9       | 9        | Punctuated & Cased | CC-BY-4.0       |

For more details on the individual datasets and how models are evaluated to give the ESB score, refer to the [ESB paper](https://arxiv.org/abs/2210.13352).
"""

LEADERBOARD_CSS = """
#leaderboard-table th .header-content {
    white-space: nowrap;
}
"""