metadata

language:
  - ur
license: apache-2.0
tags:
  - automatic-speech-recognition
  - robust-speech-event
datasets:
  - common_voice
metrics:
  - wer
  - cer
model-index:
  - name: wav2vec2-large-xlsr-53-urdu
    results:
      - task:
          type: automatic-speech-recognition
          name: Urdu Speech Recognition
        dataset:
          type: common_voice
          name: Urdu
          args: ur
        metrics:
          - type: wer
            value: 66.2
            name: Test WER
            args:
              - learning_rate: 0.0003
              - train_batch_size: 16
              - eval_batch_size: 8
              - seed: 42
              - gradient_accumulation_steps: 2
              - total_train_batch_size: 32
              - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
              - lr_scheduler_type: linear
              - lr_scheduler_warmup_steps: 200
              - num_epochs: 50
              - mixed_precision_training: Native AMP
          - type: cer
            value: 31.7
            name: Test CER
            args:
              - learning_rate: 0.0003
              - train_batch_size: 16
              - eval_batch_size: 8
              - seed: 42
              - gradient_accumulation_steps: 2
              - total_train_batch_size: 32
              - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
              - lr_scheduler_type: linear
              - lr_scheduler_warmup_steps: 200
              - num_epochs: 50
              - mixed_precision_training: Native AMP

wav2vec2-large-xlsr-53-urdu

This model is a fine-tuned version of m3hrdadfi/wav2vec2-large-xlsr-persian-v3 on the common_voice dataset. It achieves the following results on the evaluation set:

Loss: 1.5727
Wer: 0.6620
Cer: 0.3166

More information needed The training and valid dataset is 0.58 hours. It was hard to train any model on lower number of so I decided to take Persian checkpoint and finetune the XLSR model.

Training procedure

Trained on m3hrdadfi/wav2vec2-large-xlsr-persian-v3 due to lesser number of samples. Persian and Urdu are quite similar.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 200
num_epochs: 50
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer	Cer
2.9707	8.33	100	1.2689	0.8463	0.4373
0.746	16.67	200	1.2370	0.7214	0.3486
0.3719	25.0	300	1.3885	0.6908	0.3381
0.2411	33.33	400	1.4780	0.6690	0.3186
0.1841	41.67	500	1.5557	0.6629	0.3241
0.165	50.0	600	1.5727	0.6620	0.3166

Framework versions

Transformers 4.15.0
Pytorch 1.10.0+cu111
Datasets 1.17.0
Tokenizers 0.10.3