---
language: 
- ar

license: apache-2.0
tags:
- automatic-speech-recognition
- robust-speech-event
datasets:
- mozilla-foundation/common_voice_8_0
metrics:
- wer
- cer
model-index:
- name: Sinai Voice Arabic Speech Recognition Model
  results:
  - task: 
      type: automatic-speech-recognition
      name: Speech Recognition
    dataset:
      type: mozilla-foundation/common_voice_8_0
      name: Common Voice ar
      args: ar
    metrics:
      - type: wer    # Required. Example: wer
        value: 0.18  # Required. Example: 20.90
        name: Test WER # Optional. Example: Test WER
        
      - type: cer    # Required. Example: wer
        value: 0.051  # Required. Example: 20.90
        name: Test CER  # Optional. Example: Test WER
        
        WER: 0.18855042016806722
CER: 0.05138746531806014
        
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Sinai Voice Arabic Speech Recognition Model

# نموذج **صوت سيناء** للتعرف على الأصوات العربية الفصحى و تحويلها إلى نصوص

This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the common_voice 8 dataset.

It achieves the following results on the evaluation set:
- Loss: 0.22
- Wer: 0.189
- Cer:  0.051

#### Evaluation Commands
1. To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`

```bash
python eval.py --model_id bakrianoo/sinai-voice-ar-stt --dataset mozilla-foundation/common_voice_8_0 --config ar --split test
```


### Inference Without LM

```python
from transformers import (Wav2Vec2Processor, Wav2Vec2ForCTC)
import torchaudio
import torch

def speech_file_to_array_fn(voice_path, resampling_to=16000):
    speech_array, sampling_rate = torchaudio.load(voice_path)
    resampler = torchaudio.transforms.Resample(sampling_rate, resampling_to)
    
    return resampler(speech_array)[0].numpy(), sampling_rate

# load the model
cp = "bakrianoo/sinai-voice-ar-stt"
processor = Wav2Vec2Processor.from_pretrained(cp)
model = Wav2Vec2ForCTC.from_pretrained(cp)

# recognize the text in a sample sound file
sound_path = './my_voice.mp3'

sample, sr = speech_file_to_array_fn(sound_path)
inputs = processor([sample], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values,).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
```

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 32
- eval_batch_size: 10
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 128
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 8.32
- mixed_precision_training: Native AMP