Spaces:
Runtime error
Runtime error
File size: 12,488 Bytes
5b56cf9 2bac61c 8fd660c 2bac61c 84b965a 2bac61c 6efde26 b434b2a 6efde26 2bac61c d734777 2bac61c 402bc1f 5d817b3 81827f2 1ef2735 ae1eed1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
from pathlib import Path
# Directory where request by models are stored
DIR_OUTPUT_REQUESTS = Path("requested_models")
EVAL_REQUESTS_PATH = Path("eval_requests")
##########################
# Text definitions #
##########################
banner_url = "https://huggingface.co/datasets/reach-vb/random-images/resolve/main/asr_leaderboard.png"
BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>'
TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard </b> </body> </html>"
INTRODUCTION_TEXT = "📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
on the Hugging Face Hub. \
\nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️) and [RTF](https://openvoice-tech.net/index.php/Real-time-factor) (⬇️) - lower the better. Models are ranked based on their Average WER, from lowest to highest. Check the 📈 Metrics tab to understand how the models are evaluated. \
\nIf you want results for a model that is not listed here, you can submit a request for it to be included ✉️✨. \
\nThe leaderboard currently focuses on English speech recognition, and will be expanded to multilingual evaluation in later versions."
CITATION_TEXT = """@misc{open-asr-leaderboard,
title = {Open Automatic Speech Recognition Leaderboard},
author = {Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and Hugging Face Team and Nvidia NeMo Team and SpeechBrain Team},
year = 2023,
publisher = {Hugging Face},
howpublished = "\\url{https://huggingface.co/spaces/huggingface.co/spaces/open-asr-leaderboard/leaderboard}"
}
"""
METRICS_TAB_TEXT = """
Here you will find details about the speech recognition metrics and datasets reported in our leaderboard.
## Metrics
🎯 Word Error Rate (WER) and Real-Time Factor (RTF) are popular metrics for evaluating the accuracy of speech recognition
models by estimating how accurate the predictions from the models are and how fast they are returned. We explain them each
below.
### Word Error Rate (WER)
Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.
```
Example: If the reference transcript is "I really love cats," and the ASR system outputs "I don't love dogs,".
The WER would be `50%` because 2 out of 4 words are incorrect.
```
For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints. You can find the evaluation code on our [Github repository](https://github.com/huggingface/open_asr_leaderboard). To read more about how the WER is computed, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/evaluation).
### Real Time Factor (RTF)
Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an
model to process a given amount of speech. It's usually expressed as a multiple of real time. An RTF of 1 means it processes
speech as fast as it's spoken, while an RTF of 2 means it takes twice as long. Thus, **a lower RTF value indicates lower latency**.
```
Example: If it takes an ASR system 10 seconds to transcribe 10 seconds of speech, the RTF is 1.
If it takes 20 seconds to transcribe the same 10 seconds of speech, the RTF is 2.
```
For the benchmark, we report RTF averaged over a 10 minute audio sample with 5 warm up batches followed 3 graded batches.
## How to reproduce our results
The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible.
Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations.
For more details head over to our repo at: https://github.com/huggingface/open_asr_leaderboard
P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! ♥️
## Benchmark datasets
Evaluating Speech Recognition systems is a hard problem. We use the multi-dataset benchmarking strategy proposed in the
[ESB paper](https://arxiv.org/abs/2210.13352) to obtain robust evaluation scores for each model.
ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad
set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains,
acoustic conditions, speaker styles, and transcription requirements. As such, it gives a better indication of how
a model is likely to perform on downstream ASR compared to evaluating it on one dataset alone.
The ESB score is calculated as a macro-average of the WER scores across the ESB datasets. The models in the leaderboard
are ranked based on their average WER scores, from lowest to highest.
| Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License |
|-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------|
| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | Audiobook | Narrated | 960 | 11 | 11 | Normalised | CC-BY-4.0 |
| [Common Voice 9](https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0) | Wikipedia | Narrated | 1409 | 27 | 27 | Punctuated & Cased | CC0-1.0 |
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 |
| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | TED talks | Oratory | 454 | 2 | 3 | Normalised | CC-BY-NC-ND 3.0 |
| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | apache-2.0 |
| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech) | Fincancial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement |
| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22) | Fincancial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 |
| [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 |
For more details on the individual datasets and how models are evaluated to give the ESB score, refer to the [ESB paper](https://arxiv.org/abs/2210.13352).
"""
CUSTOM_MESSAGE = """## Using CommonVoice to approximate average WER for open domain transcription
This space is a fork of the original [hf-audio/open_asr_leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard). It aims to demonstrate how the CommonVoice Test Set provides a relatively accurate approximation of the average WER/CER (Word Error Rate/Character Error Rate) at a significantly lower computational cost.
#### Why is this useful?
This opens the way to achieve standardized test set for most languages, enabling us to programmatically select a reasonably effective model for any language supported by CommonVoice.
For more context, [here](https://gist.github.com/wasertech/400ca3dd61f2d6f7f4f5495afbb32ef3) is the output of my ASR server when running without any specified model to load for various languages. It tries to score the most suitable model for any given language. Since metrics are mostly self-reported, sometimes in different format, it consistently picks an unadequate model.
Columns `Model`, `RTF`, and `Average WER` were sourced from [hf-audio/open_asr_leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) using the version from September 7, 2023.
Models are sorted by consistancy in their results across testsets. (by increasing order of absolute delta between average WER and CommonVoice WER)
### Results
The CommonVoice Test provides a Word Error Rate (WER) within a 20-point margin of the average WER. While not perfect, this indicates that CommonVoice can be a useful tool for quickly identifying a suitable ASR model for a wide range of languages in a programmatic manner. However, it's important to note that it is not sufficient as the sole criterion for choosing the most appropriate architecture. Further considerations may be needed depending on the specific requirements of your ASR application.
Moreover, it's worth noting that selecting the model with the lowest WER on CommonVoice aligns with choosing the model based on the lowest average WER. This approach proves effective for ranking the best-performing models with precision. However, it's essential to acknowledge that as the average WER increases, the spread of results becomes more pronounced. This can pose challenges in reliably identifying the worst-performing models. The test split size of CommonVoice for a given language is a crucial factor in this context, and it's worth considering. This insight highlights the need for a nuanced approach to ASR model selection, considering various factors, including dataset characteristics, to ensure a comprehensive evaluation of ASR model performance.
Additionally, it has come to our attention that Nvidia's models, trained using NeMo with custom splits from common datasets, including Common Voice, may have had an advantage due to their familiarity with parts of the Common Voice test set. It's important to note that this highlights the need for greater transparency in data usage, as OpenAI itself does not publish the data they used for training. This could explain their strong performance in the results. Transparency in model training and dataset usage is crucial for fair comparisons in the ASR field and ensuring that results align with real-world scenarios.
Custom splits and potential data leakage during training can indeed lead to misleading results, making it challenging to compare architectures accurately.
To address these concerns and ensure the reliability of metrics on the leaderboard:
1. **Transparency in Training Data**: Model submissions should come with detailed information about the training data used, including whether they have seen the specific test sets used for evaluation. This transparency enables the community to assess the validity of the results.
2. **Standardized Evaluation**: Promote the use of standardized evaluation datasets and testing procedures across models. This helps prevent data leakage and ensures fair comparisons.
3. **Verification and Validation**: Implement verification processes to check the integrity of submitted models. This could include cross-validation checks to identify any potential issues with custom splits or data leakage.
4. **Community Engagement**: Encourage active participation and feedback from the ASR community. Regular discussions and collaborations can help identify and address issues related to data integrity and model evaluations.
5. **Documentation**: Models added to the leaderboard should provide comprehensive documentation, including information on dataset usage, preprocessing steps, and any custom splits employed during training.
By focusing on these aspects, we can enhance trust in the metrics and evaluations within the ASR community and ensure that the models added to the leaderboard are reliable and accurately represent their performance. It's essential for the community to work together to maintain transparency and data integrity.
"""
|