Automatic Speech Recognition
Transformers
Safetensors
Swahili
English
whisper
Generated from Trainer
Jacaranda commited on
Commit
6e1e9e6
·
verified ·
1 Parent(s): 906707e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +238 -16
README.md CHANGED
@@ -18,36 +18,185 @@ language:
18
  - en
19
  pipeline_tag: automatic-speech-recognition
20
  ---
21
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
22
- should probably proofread and complete it, then remove this comment. -->
23
 
24
- # stt
25
 
26
- This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on an unknown dataset.
27
- It achieves the following results on the evaluation set:
28
- - Loss: 0.3390
29
- - Wer Ortho: 24.0940
30
- - Wer: 16.4814
31
 
 
32
 
33
- ## Training procedure
 
 
 
34
 
35
- ### Training hyperparameters
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  The following hyperparameters were used during training:
38
  - learning_rate: 1e-05
39
  - train_batch_size: 16
40
  - eval_batch_size: 16
41
  - seed: 42
42
- - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
43
  - lr_scheduler_type: cosine
44
  - lr_scheduler_warmup_steps: 50
45
  - training_steps: 8000
46
  - mixed_precision_training: Native AMP
47
 
 
48
 
49
- ### Training results
50
- | Training Loss | Epoch | Step | Validation Loss | Wer Ortho | Wer |
51
  |:-------------:|:------:|:----:|:---------------:|:---------:|:-------:|
52
  | 0.4135 | 0.6180 | 500 | 0.4069 | 29.9115 | 21.6319 |
53
  | 0.2036 | 1.2361 | 1000 | 0.3584 | 25.8738 | 18.3552 |
@@ -66,9 +215,82 @@ The following hyperparameters were used during training:
66
  | 0.0006 | 9.2707 | 7500 | 0.4297 | 21.3378 | 14.7059 |
67
  | 0.0006 | 9.8888 | 8000 | 0.4300 | 21.3276 | 14.7093 |
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
- ### Framework versions
71
  - Transformers 4.51.3
72
- - Pytorch 2.5.1+cu121
73
  - Datasets 3.6.0
74
- - Tokenizers 0.21.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  - en
19
  pipeline_tag: automatic-speech-recognition
20
  ---
 
 
21
 
22
+ # Swahili-English Speech-to-Text (STT) Model
23
 
24
+ This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) specifically optimized for Swahili and English speech recognition. The model has been trained on Common Voice 17.0 dataset and achieves significant improvements in word error rate (WER) compared to the base model.
 
 
 
 
25
 
26
+ ## Model Performance
27
 
28
+ The model achieves the following results on the evaluation set:
29
+ - **Loss**: 0.3390
30
+ - **WER Ortho**: 21.3
31
+ - **WER**: 14.7
32
 
33
+ ## Usage
34
+
35
+ ### Installation
36
+
37
+ First, install the required dependencies:
38
+
39
+ ```bash
40
+ pip install transformers torch librosa
41
+ ```
42
+
43
+ ### Basic Usage
44
+
45
+ ```python
46
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
47
+ import torch
48
+ import librosa
49
+
50
+ # Load the model and processor
51
+ processor = AutoProcessor.from_pretrained("Jacaranda-Health/ASR-STT")
52
+ model = AutoModelForSpeechSeq2Seq.from_pretrained("Jacaranda-Health/ASR-STT")
53
+ model.generation_config.forced_decoder_ids = None
54
+
55
+ def transcribe(filepath):
56
+ """
57
+ Transcribe audio file to text
58
+
59
+ Args:
60
+ filepath (str): Path to audio file
61
+
62
+ Returns:
63
+ str: Transcribed text
64
+ """
65
+ # Load audio file
66
+ audio, sr = librosa.load(filepath, sr=16000)
67
+
68
+ # Process audio
69
+ inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
70
+
71
+ # Generate transcription
72
+ with torch.no_grad():
73
+ generated_ids = model.generate(inputs["input_features"])
74
+
75
+ # Decode the transcription
76
+ transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
77
+
78
+ return transcription
79
+
80
+ # Example usage
81
+ transcription = transcribe("path/to/your/audio.wav")
82
+ print(f"Transcription: {transcription}")
83
+ ```
84
+
85
+ ### Batch Processing
86
+
87
+ ```python
88
+ def transcribe_batch(audio_files):
89
+ """
90
+ Transcribe multiple audio files
91
+
92
+ Args:
93
+ audio_files (list): List of audio file paths
94
+
95
+ Returns:
96
+ list: List of transcriptions
97
+ """
98
+ transcriptions = []
99
+
100
+ for filepath in audio_files:
101
+ try:
102
+ transcription = transcribe(filepath)
103
+ transcriptions.append({
104
+ 'file': filepath,
105
+ 'transcription': transcription
106
+ })
107
+ except Exception as e:
108
+ transcriptions.append({
109
+ 'file': filepath,
110
+ 'error': str(e)
111
+ })
112
+
113
+ return transcriptions
114
+
115
+ # Example usage
116
+ audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
117
+ results = transcribe_batch(audio_files)
118
+ ```
119
+
120
+ ## Model Comparison
121
+
122
+ The fine-tuned model shows **dramatic improvements** over the base Whisper model, particularly in Swahili language accuracy. Here are some comparison examples showing how the base model completely failed while our fine-tuned model nailed it:
123
+
124
+ ### Example 1: Complete Language Confusion
125
+ - **Ground Truth**: "Panya wengi huishi kati ya wanadamu."
126
+ - **Base Model**: "本来我以为是个铁网来的" *(Chinese characters!)*
127
+ - **Fine-tuned Model**: "Wanyawengi huishi kati ya wanadamu." ✓ <br><br>
128
+
129
+ - **Ground Truth**: "Mji ulianzishwa kwenye kisiwa kilichopo karibu sana na bara."
130
+ - **Base Model**: "Nguni unia nzisho kwenye kisiwa kilichopo kariwu sana nabara"
131
+ - **Fine-tuned Model**: "Mji ulianzishwa kwenye kisiwa kilichopo karibu sana na bara." ✓ <br><br>
132
+
133
+
134
+ - **Ground Truth**: "Nchi ya maajabu."
135
+ - **Base Model**: "Um dia mais, diabo!" *(Portuguese/Spanish)*
136
+ - **Fine-tuned Model**: "Nchi ya maajabu." ✓
137
+
138
+ ### Example 2: Arabic Script Mix
139
+ - **Ground Truth**: "Alama yake ni µm."
140
+ - **Base Model**: "الله معاكي لأم" *(Arabic script)*
141
+ - **Fine-tuned Model**: "Alama yake ni µm." ✓
142
+
143
+ ### Example 3: English Instead of Swahili
144
+ - **Ground Truth**: "Ni msimamizi wa mtandao na wa wanafunzi."
145
+ - **Base Model**: "You don't see no music on Tyndale? No, I don't see no music on Tyndale."
146
+ - **Fine-tuned Model**: "Ni msimamizi wa mtandao na wa wanafunzi." ✓
147
+
148
+
149
+ ## Key Improvements
150
+
151
+ The fine-tuned model demonstrates superior performance in:
152
+
153
+ - **Swahili Grammar**: Better handling of Swahili sentence structure and grammar
154
+ - **Word Recognition**: More accurate recognition of Swahili vocabulary
155
+ - **Context Understanding**: Improved contextual understanding across different domains
156
+ - **Pronunciation Variants**: Better handling of different Swahili pronunciation patterns
157
+ - **Mixed Language**: Enhanced performance on code-switched Swahili-English content
158
+
159
+ ## Training Visualizations
160
+
161
+ The following charts illustrate the model's training progress and performance improvements:
162
+
163
+ ### Word Error Rate (WER) Progress
164
+ ![WER Progress](./Assets/wer.png)
165
+
166
+ The WER chart shows the steady improvement in transcription accuracy throughout the training process. Starting from approximately 21.6% WER at step 500, the model achieves its best performance of 14.7% WER by step 8000, demonstrating consistent learning and convergence.
167
+
168
+ ### Learning Rate Schedule
169
+ ![Learning Rate](./Assets/lr.png)
170
+
171
+ The learning rate follows a cosine annealing schedule, starting at 1e-05 and gradually decreasing over the 8000 training steps. This schedule helps ensure stable training and prevents overfitting while allowing the model to fine-tune effectively.
172
+
173
+ ## Training Details
174
+
175
+ ### Training Procedure
176
+
177
+ The model was fine-tuned using the following approach:
178
+ - **Base Model**: OpenAI Whisper Medium
179
+ - **Dataset**: Mozilla Common Voice 17.0 (Swahili and English)
180
+ - **Training Steps**: 8,000 steps
181
+ - **Learning Rate**: 1e-05 with cosine scheduler
182
+ - **Batch Size**: 16 (train and eval)
183
+
184
+ ### Training Hyperparameters
185
 
186
  The following hyperparameters were used during training:
187
  - learning_rate: 1e-05
188
  - train_batch_size: 16
189
  - eval_batch_size: 16
190
  - seed: 42
191
+ - optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
192
  - lr_scheduler_type: cosine
193
  - lr_scheduler_warmup_steps: 50
194
  - training_steps: 8000
195
  - mixed_precision_training: Native AMP
196
 
197
+ ### Training Results
198
 
199
+ | Training Loss | Epoch | Step | Validation Loss | WER Ortho | WER |
 
200
  |:-------------:|:------:|:----:|:---------------:|:---------:|:-------:|
201
  | 0.4135 | 0.6180 | 500 | 0.4069 | 29.9115 | 21.6319 |
202
  | 0.2036 | 1.2361 | 1000 | 0.3584 | 25.8738 | 18.3552 |
 
215
  | 0.0006 | 9.2707 | 7500 | 0.4297 | 21.3378 | 14.7059 |
216
  | 0.0006 | 9.8888 | 8000 | 0.4300 | 21.3276 | 14.7093 |
217
 
218
+ ## Supported Languages
219
+
220
+ - **Primary**: Swahili (sw)
221
+ - **Secondary**: English (en)
222
+
223
+
224
+ ## Out-of-Scope Use
225
+
226
+ The use of this Speech-to-Text (ASR) model is intended for research, social good, and internal use purposes only. For commercial use and distribution, organizations/individuals are encouraged to contact **Jacaranda Health**. To ensure the ethical and responsible use of this ASR model, we have outlined a set of guidelines. These guidelines categorize activities and practices into three main areas: prohibited actions, high-risk activities, and deceptive practices. By understanding and adhering to these directives, users can contribute to a safer and more trustworthy environment.
227
+
228
+ ### 1. Prohibited Actions:
229
+ * **Illegal Activities:** Avoid using the model to transcribe content that promotes violence, child exploitation, human trafficking, and other crimes.
230
+ * **Harassment and Discrimination:** No transcription activities that facilitate bullying, threats, or discriminatory practices.
231
+ * **Unauthorized Surveillance:** No unlicensed monitoring or recording of individuals without proper consent.
232
+ * **Data Misuse:** Handle audio data and transcriptions with proper consents and privacy protections.
233
+ * **Rights Violations:** Respect third-party intellectual property and privacy rights in audio content.
234
+ * **Malicious Content Creation:** Avoid transcribing content intended for harmful software or malicious purposes.
235
+
236
+ ### 2. High-Risk Activities:
237
+ * **Sensitive Industries:** Exercise extreme caution when using in military, nuclear, or intelligence domains.
238
+ * **Legal Proceedings:** Avoid usage as sole evidence in critical legal or judicial processes without proper validation.
239
+ * **Critical Systems:** No deployment in safety-critical infrastructure or transport technologies without extensive testing.
240
+ * **Medical Diagnosis:** Avoid using transcriptions for direct medical diagnosis or treatment decisions.
241
+ * **Emergency Services:** Not suitable as primary tool for emergency response systems.
242
+
243
+ ### 3. Deceptive Practices:
244
+ * **Misinformation:** Refrain from using transcriptions to create or promote fraudulent or misleading audio content.
245
+ * **Deepfake Audio:** Avoid using transcriptions to facilitate creation of deceptive synthetic audio.
246
+ * **Impersonation:** No transcribing content intended to impersonate individuals without authorization.
247
+ * **Misrepresentation:** No false claims about transcription accuracy or model capabilities.
248
+ * **Fake Content Generation:** No promotion of false audio-text pairs or fabricated conversations.
249
+
250
+ ## Bias, Risks, and Limitations
251
+
252
+ This Speech-to-Text model represents cutting-edge technology with significant potential, yet it is not without inherent risks and limitations. The extensive testing conducted has been predominantly focused on Swahili and English languages, leaving an expansive terrain of linguistic variations and acoustic scenarios unexplored.
253
+
254
+ ### Key Limitations:
255
+
256
+ **Language and Dialect Variations**: The model's performance may vary significantly across different Swahili dialects, regional accents, and code-switching patterns not represented in the training data.
257
+
258
+ **Audio Quality Sensitivity**: Performance degrades with poor audio quality, background noise, multiple speakers, or non-standard recording conditions.
259
+
260
+ **Domain Specificity**: The model may struggle with highly technical terminology, proper names, or domain-specific vocabulary outside its training scope.
261
+
262
+ **Contextual Understanding**: While improved over the base model, contextual interpretation limitations may lead to incorrect transcriptions in ambiguous scenarios.
263
+
264
+ **Bias Considerations**: Like other AI models, this ASR system may exhibit biases present in the training data, potentially affecting transcription quality for underrepresented speaker groups or topics.
265
+
266
+ ### Responsible Deployment:
267
+
268
+ Consequently, like other ASR systems, this model's output predictability remains variable, and there's potential for it to occasionally generate transcriptions that are inaccurate, culturally insensitive, or otherwise problematic when processing certain audio inputs.
269
+
270
+ Prior to deploying this ASR model in any production applications, developers must embark on thorough safety testing and meticulous evaluation customized to the unique demands of their specific use cases. This includes testing across diverse speaker demographics, audio conditions, and content types relevant to the intended application.
271
+
272
+ ## Contact Us
273
+
274
+ For any questions, feedback, or commercial inquiries, please reach out at **ai@jacarandahealth.org**
275
+
276
+
277
+ ## Framework Versions
278
 
 
279
  - Transformers 4.51.3
280
+ - PyTorch 2.5.1+cu121
281
  - Datasets 3.6.0
282
+ - Tokenizers 0.21.1
283
+
284
+ ## Citation
285
+
286
+ If you use this model in your research, please cite:
287
+
288
+ ```bibtex
289
+ @misc{jacaranda_asr_stt_2025,
290
+ title={Swahili-English Speech-to-Text Model},
291
+ author={Jacaranda Health},
292
+ year={2025},
293
+ howpublished={\url{https://huggingface.co/Jacaranda-Health/ASR-STT}}
294
+ }
295
+ ```
296
+