metadata

language:
  - ar
license: apache-2.0
tags:
  - automatic-speech-recognition
  - robust-speech-event
datasets:
  - mozilla-foundation/common_voice_8_0
metrics:
  - wer
  - cer
model-index:
  - name: Sinai Voice Arabic Speech Recognition Model
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          type: mozilla-foundation/common_voice_8_0
          name: Common Voice ar
          args: ar
        metrics:
          - type: wer
            value: 0.18
            name: Test WER
          - type: cer
            value: 0.051
            name: Test CER
            WER: 0.18855042016806722
CER: 0.05138746531806014

Sinai Voice Arabic Speech Recognition Model

نموذج صوت سيناء للتعرف على الأصوات العربية الفصحى و تحويلها إلى نصوص

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the common_voice 8 dataset.

It achieves the following results on the evaluation set:

Loss: 0.22
Wer: 0.189
Cer: 0.051

Evaluation Commands

To evaluate on mozilla-foundation/common_voice_8_0 with split test

python eval.py --model_id bakrianoo/sinai-voice-ar-stt --dataset mozilla-foundation/common_voice_8_0 --config ar --split test

Inference Without LM

from transformers import (Wav2Vec2Processor, Wav2Vec2ForCTC)
import torchaudio
import torch

def speech_file_to_array_fn(voice_path, resampling_to=16000):
    speech_array, sampling_rate = torchaudio.load(voice_path)
    resampler = torchaudio.transforms.Resample(sampling_rate, resampling_to)
    
    return resampler(speech_array)[0].numpy(), sampling_rate

# load the model
cp = "bakrianoo/sinai-voice-ar-stt"
processor = Wav2Vec2Processor.from_pretrained(cp)
model = Wav2Vec2ForCTC.from_pretrained(cp)

# recognize the text in a sample sound file
sound_path = './my_voice.mp3'

sample, sr = speech_file_to_array_fn(sound_path)
inputs = processor([sample], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values,).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 32
eval_batch_size: 10
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 128
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 8.32
mixed_precision_training: Native AMP