benderrodriguez
commited on
Commit
·
6400e0f
1
Parent(s):
b69f610
Update leaderboard, and documentation.
Browse files- benchmark.csv +4 -4
- src/about.py +89 -27
benchmark.csv
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
-
model,ivrit-ai/eval-d1,ivrit-ai/saspeech,google/fleurs/he,mozilla-foundation/common_voice_17_0/he
|
2 |
ivrit-ai/whisper-large-v2-d4,0.063,0.080,0.242,0.207
|
3 |
ivrit-ai/whisper-v2-d3-e3,0.069,0.086,0.256,0.214
|
4 |
-
openai/whisper-large-v2,0.083,0.100,0.276,0.234
|
5 |
openai/whisper-large-v3,0.101,0.096,0.262,0.232
|
6 |
-
aws-transcribe-batch-20241205,0.069,0.087,0.230,0.141
|
7 |
-
aws-transcribe-stream-20241205,0.082,0.092,0.287,0.199
|
|
|
1 |
+
model,ivrit-ai/eval-d1,ivrit-ai/saspeech,google/fleurs/he,mozilla-foundation/common_voice_17_0/he,imvladikon/hebrew_speech_kan
|
2 |
ivrit-ai/whisper-large-v2-d4,0.063,0.080,0.242,0.207
|
3 |
ivrit-ai/whisper-v2-d3-e3,0.069,0.086,0.256,0.214
|
4 |
+
openai/whisper-large-v2,0.083,0.100,0.276,0.234,0.209
|
5 |
openai/whisper-large-v3,0.101,0.096,0.262,0.232
|
6 |
+
aws-transcribe-batch-20241205,0.069,0.087,0.230,0.141,0.092
|
7 |
+
aws-transcribe-stream-20241205,0.082,0.092,0.287,0.199,0.132
|
src/about.py
CHANGED
@@ -21,50 +21,112 @@ NUM_FEWSHOT = 0 # Change with your few shot
|
|
21 |
|
22 |
|
23 |
# Your leaderboard name
|
24 |
-
TITLE = """<h1 align="center" id="space-title">
|
25 |
|
26 |
# What does your leaderboard evaluate?
|
27 |
INTRODUCTION_TEXT = """
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
"""
|
30 |
|
31 |
-
#
|
32 |
-
LLM_BENCHMARKS_TEXT =
|
33 |
## How it works
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
## Reproducibility
|
36 |
-
To
|
37 |
|
38 |
-
|
|
|
|
|
39 |
|
40 |
-
|
41 |
-
## Some good practices before submitting a model
|
42 |
|
43 |
-
|
44 |
-
|
45 |
-
from transformers import AutoConfig, AutoModel, AutoTokenizer
|
46 |
-
config = AutoConfig.from_pretrained("your model name", revision=revision)
|
47 |
-
model = AutoModel.from_pretrained("your model name", revision=revision)
|
48 |
-
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
|
49 |
```
|
50 |
-
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
|
51 |
|
52 |
-
|
53 |
-
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!
|
54 |
|
55 |
-
|
56 |
-
|
57 |
|
58 |
-
###
|
59 |
-
|
60 |
|
61 |
-
|
62 |
-
|
|
|
|
|
|
|
|
|
63 |
|
64 |
-
|
65 |
-
|
66 |
-
Make sure
|
67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
"""
|
69 |
|
70 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
|
|
21 |
|
22 |
|
23 |
# Your leaderboard name
|
24 |
+
TITLE = """<h1 align="center" id="space-title">Hebrew Speech Recognition Leaderboard</h1>"""
|
25 |
|
26 |
# What does your leaderboard evaluate?
|
27 |
INTRODUCTION_TEXT = """
|
28 |
+
Welcome to the Hebrew Speech Recognition Leaderboard! This is a community-driven effort to track and compare the performance
|
29 |
+
of various speech recognition models on Hebrew language tasks.
|
30 |
+
|
31 |
+
This leaderboard is maintained by [ivrit.ai](https://ivrit.ai), a project dedicated to advancing Hebrew language AI technologies.
|
32 |
+
You can find our work on [GitHub](https://github.com/ivrit-ai) and [Hugging Face](https://huggingface.co/ivrit-ai).
|
33 |
+
|
34 |
+
## Motivation
|
35 |
+
Hebrew presents unique challenges for speech recognition due to its rich morphology, absence of written vowels, and diverse
|
36 |
+
dialectal variations. This leaderboard aims to:
|
37 |
+
- Provide standardized benchmarks for Hebrew ASR evaluation
|
38 |
+
- Track progress in Hebrew speech recognition technology
|
39 |
+
- Foster collaboration in the Hebrew NLP community
|
40 |
+
- Make Hebrew speech technology more accessible
|
41 |
+
|
42 |
+
## Benchmarks
|
43 |
+
The following datasets are used in our evaluation:
|
44 |
+
|
45 |
+
### [ivrit-ai/eval-d1](https://huggingface.co/datasets/ivrit-ai/eval-d1)
|
46 |
+
- **Size**: 2 hours
|
47 |
+
- **Domain**: Manual transcription of podcasts. Typical segment length is 5 minutes.
|
48 |
+
- **Source**: Description of source
|
49 |
+
|
50 |
+
### [ivrit-ai/saspeech](https://huggingface.co/datasets/ivrit-ai/saspeech)
|
51 |
+
- **Size**: X hours
|
52 |
+
- **Domain**: Description
|
53 |
+
- **Source**: Description of source
|
54 |
+
|
55 |
+
### [google/fleurs/he](https://huggingface.co/datasets/google/fleurs)
|
56 |
+
- **Size**: X hours
|
57 |
+
- **Domain**: Description
|
58 |
+
- **Source**: Description of source
|
59 |
+
|
60 |
+
### [mozilla-foundation/common_voice_17_0/he](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
|
61 |
+
- **Size**: X hours
|
62 |
+
- **Domain**: Description
|
63 |
+
- **Source**: Description of source
|
64 |
+
|
65 |
+
### [imvladikon/hebrew_speech_kan](https://huggingface.co/datasets/imvladikon/hebrew_speech_kan)
|
66 |
+
- **Size**: X hours
|
67 |
+
- **Domain**: Description
|
68 |
+
- **Source**: Description of source
|
69 |
"""
|
70 |
|
71 |
+
# Technical details about evaluation
|
72 |
+
LLM_BENCHMARKS_TEXT = """
|
73 |
## How it works
|
74 |
+
Models are evaluated using Word Error Rate (WER) on each benchmark dataset. The final score is an average of WER across all benchmarks,
|
75 |
+
with lower scores indicating better performance.
|
76 |
+
|
77 |
+
Specifically, evaluation is done using the [jiwer](https://github.com/jitsi/jiwer) library.
|
78 |
+
Source code for the evaluation can be found [here](https://github.com/ivrit-ai/asr-training/blob/master/evaluate_model.py).
|
79 |
|
80 |
## Reproducibility
|
81 |
+
To evaluate your model on these benchmarks, you can use our evaluation script as follows:
|
82 |
|
83 |
+
```bash
|
84 |
+
./evaluate_model.py --engine <engine> --model <model> --dataset <dataset:split:column> [--name <name>] [--workers <num_workers>]
|
85 |
+
```
|
86 |
|
87 |
+
For example, here's how to evaluate ivrit-ai/faster-whisper-v2-d4 on the google/fleurs/he dataset:
|
|
|
88 |
|
89 |
+
```bash
|
90 |
+
./evaluate_model.py --engine faster-whisper --model ivrit-ai/faster-whisper-v2-d4 --name he_il --dataset google/fleurs:test:transcription --workers 1
|
|
|
|
|
|
|
|
|
91 |
```
|
|
|
92 |
|
93 |
+
"""
|
|
|
94 |
|
95 |
+
EVALUATION_QUEUE_TEXT = """
|
96 |
+
## Submitting a model for evaluation
|
97 |
|
98 |
+
### 1) Provide an inference script
|
99 |
+
To evaluate your model, we need either:
|
100 |
|
101 |
+
a) A simple inference script that takes audio input and returns transcribed text:
|
102 |
+
```python
|
103 |
+
def transcribe(audio_path: str) -> str:
|
104 |
+
# Your model loading and inference code here
|
105 |
+
return transcribed_text
|
106 |
+
```
|
107 |
|
108 |
+
b) Or augment our evaluate_model.py script with your model's implementation.
|
109 |
+
|
110 |
+
### 2) Make sure your model is publicly accessible
|
111 |
+
Your model should be available on the Hugging Face Hub with:
|
112 |
+
- Public visibility
|
113 |
+
- Clear licensing information
|
114 |
+
- Basic model card documentation
|
115 |
+
|
116 |
+
### 3) Fill up your model card
|
117 |
+
Please include in your model card:
|
118 |
+
- Model architecture
|
119 |
+
- Training data description
|
120 |
+
- Licensing information
|
121 |
+
- Any special preprocessing requirements
|
122 |
+
- Expected input format (sampling rate, audio format, etc.)
|
123 |
+
|
124 |
+
## In case of evaluation failure
|
125 |
+
If your model evaluation fails, please:
|
126 |
+
1. Check that your model can be loaded and run locally
|
127 |
+
2. Verify your inference script works with our benchmark format
|
128 |
+
3. Ensure all dependencies are clearly specified
|
129 |
+
4. Contact us through GitHub issues if problems persist
|
130 |
"""
|
131 |
|
132 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|