sinai-voice-ar-stt / README.md
bakrianoo's picture
update README
c9f41c7
|
raw
history blame
No virus
3.04 kB
---
language:
- ar
license: apache-2.0
tags:
- automatic-speech-recognition
- robust-speech-event
datasets:
- mozilla-foundation/common_voice_8_0
metrics:
- wer
- cer
model-index:
- name: Sinai Voice Arabic Speech Recognition Model
results:
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
type: mozilla-foundation/common_voice_8_0
name: Common Voice ar
args: ar
metrics:
- type: wer # Required. Example: wer
value: 0.18 # Required. Example: 20.90
name: Test WER # Optional. Example: Test WER
- type: cer # Required. Example: wer
value: 0.051 # Required. Example: 20.90
name: Test CER # Optional. Example: Test WER
WER: 0.18855042016806722
CER: 0.05138746531806014
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Sinai Voice Arabic Speech Recognition Model
# نموذج **صوت سيناء** للتعرف على الأصوات العربية الفصحى و تحويلها إلى نصوص
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the common_voice 8 dataset.
It achieves the following results on the evaluation set:
- Loss: 0.22
- Wer: 0.189
- Cer: 0.051
#### Evaluation Commands
1. To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`
```bash
python eval.py --model_id bakrianoo/sinai-voice-ar-stt --dataset mozilla-foundation/common_voice_8_0 --config ar --split test
```
### Inference Without LM
```python
from transformers import (Wav2Vec2Processor, Wav2Vec2ForCTC)
import torchaudio
import torch
def speech_file_to_array_fn(voice_path, resampling_to=16000):
speech_array, sampling_rate = torchaudio.load(voice_path)
resampler = torchaudio.transforms.Resample(sampling_rate, resampling_to)
return resampler(speech_array)[0].numpy(), sampling_rate
# load the model
cp = "bakrianoo/sinai-voice-ar-stt"
processor = Wav2Vec2Processor.from_pretrained(cp)
model = Wav2Vec2ForCTC.from_pretrained(cp)
# recognize the text in a sample sound file
sound_path = './my_voice.mp3'
sample, sr = speech_file_to_array_fn(sound_path)
inputs = processor([sample], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values,).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
```
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 32
- eval_batch_size: 10
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 128
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 8.32
- mixed_precision_training: Native AMP