metadata

language:
  - bn
license: apache-2.0
tags:
  - automatic-speech-recognition
  - bn
  - common_voice_9_0
  - openslr_SLR53
datasets:
  - common_voice_bn
  - openSLR53
  - multilingual_librispeech
metrics:
  - wer
  - cer
language_bcp47:
  - bn-BD
base_model: arijitx/wav2vec2-xls-r-300m-bengali
model-index:
  - name: shahruk10/wav2vec2-xls-r-300m-bengali-commonvoice
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          name: Common Voice (Bengali)
          type: common_voice_9_0
          args: common_voice_bn
        metrics:
          - type: wer
            value: 0.01793038418929547
            name: Validation WER with 5-gram LM
          - type: cer
            value: 0.08078964599673999
            name: Validation CER with 5-gram LM

Wav2Vec2-XLS-R-300M-Bengali-CommonVoice

This model is a fine-tuned version of arijitx/wav2vec2-xls-r-300m-bengali on the the Common Voice 9.0 Bengali dataset. In total, the model was trained on ~300 hours of Bengali (Bangladesh accent) 16 kHz audio data.
The training and and validation partitions used were provided by the organizers of the BUET CSE Fest 2022 DL Sprint Competition on Kaggle.
The model placed first on both the public and private leader boards.
A 5-gram language model generated from the training split was used with model.

Metrics

The model was evaluated using Word Error Rate (WER) and Character Error Rate (CER) for the validation set. At the time, the test set labels were not made available by the organizers of the Kaggle competition which provided the data splits for training.

Model	Split	CER	WER
With 5-gram LM	Validation	0.08079	0.017939

Training

The training notebook for this model can be found on Kaggle here.
The inference notebook for this model can be found on Kaggle here.
The model was first trained for 15 epochs on the training split (with on-the-fly augmentation). Dropouts were enabled and a cosine decay learning rate schedule starting from 3e-5 was used.
The best iteration from the first run was further fine-tuned for 5 epochs at constant learning rate of 1e-7 with dropouts disabled.