xls-r-uzbek-cv8 / README.md
lucio's picture
Librarian Bot: Add base_model information to model (#31)
f587699
metadata
language:
  - uz
license: apache-2.0
tags:
  - automatic-speech-recognition
  - generated_from_trainer
  - hf-asr-leaderboard
  - mozilla-foundation/common_voice_8_0
  - robust-speech-event
datasets:
  - mozilla-foundation/common_voice_8_0
base_model: facebook/wav2vec2-xls-r-300m
model-index:
  - name: XLS-R-300M Uzbek CV8
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Common Voice 8
          type: mozilla-foundation/common_voice_8_0
          args: uz
        metrics:
          - type: wer
            value: 15.065
            name: Test WER (with LM)
          - type: cer
            value: 3.077
            name: Test CER (with LM)
          - type: wer
            value: 32.88
            name: Test WER (no LM)
          - type: cer
            value: 6.53
            name: Test CER (no LM)

XLS-R-300M Uzbek CV8

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the MOZILLA-FOUNDATION/COMMON_VOICE_8_0 - UZ dataset. It achieves the following results on the validation set:

  • Loss: 0.3063
  • Wer: 0.3852
  • Cer: 0.0777

Model description

For a description of the model architecture, see facebook/wav2vec2-xls-r-300m

The model vocabulary consists of the Modern Latin alphabet for Uzbek, with punctuation removed. Note that the characters <‘> and <’> do not count as punctuation, as <‘> modifies <o> and <g>, and <’> indicates the glottal stop or a long vowel.

The decoder uses a kenlm language model built on common_voice text.

Intended uses & limitations

This model is expected to be of some utility for low-fidelity use cases such as:

  • Draft video captions
  • Indexing of recorded broadcasts

The model is not reliable enough to use as a substitute for live captions for accessibility purposes, and it should not be used in a manner that would infringe the privacy of any of the contributors to the Common Voice dataset nor any other speakers.

Training and evaluation data

The 50% of the train common voice official split was used as training data. The 50% of the official dev split was used as validation data, and the full test set was used for final evaluation of the model without LM, while the model with LM was evaluated only on 500 examples from the test set.

The kenlm language model was compiled from the target sentences of the train + other dataset splits.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 32
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 128
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 100.0
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer Cer
3.1401 3.25 500 3.1146 1.0 1.0
2.7484 6.49 1000 2.2842 1.0065 0.7069
1.0899 9.74 1500 0.5414 0.6125 0.1351
0.9465 12.99 2000 0.4566 0.5635 0.1223
0.8771 16.23 2500 0.4212 0.5366 0.1161
0.8346 19.48 3000 0.3994 0.5144 0.1102
0.8127 22.73 3500 0.3819 0.4944 0.1051
0.7833 25.97 4000 0.3705 0.4798 0.1011
0.7603 29.22 4500 0.3661 0.4704 0.0992
0.7424 32.47 5000 0.3529 0.4577 0.0957
0.7251 35.71 5500 0.3410 0.4473 0.0928
0.7106 38.96 6000 0.3401 0.4428 0.0919
0.7027 42.21 6500 0.3355 0.4353 0.0905
0.6927 45.45 7000 0.3308 0.4296 0.0885
0.6828 48.7 7500 0.3246 0.4204 0.0863
0.6706 51.95 8000 0.3250 0.4233 0.0868
0.6629 55.19 8500 0.3264 0.4159 0.0849
0.6556 58.44 9000 0.3213 0.4100 0.0835
0.6484 61.69 9500 0.3182 0.4124 0.0837
0.6407 64.93 10000 0.3171 0.4050 0.0825
0.6375 68.18 10500 0.3150 0.4039 0.0822
0.6363 71.43 11000 0.3129 0.3991 0.0810
0.6307 74.67 11500 0.3114 0.3986 0.0807
0.6232 77.92 12000 0.3103 0.3895 0.0790
0.6216 81.17 12500 0.3086 0.3891 0.0790
0.6174 84.41 13000 0.3082 0.3881 0.0785
0.6196 87.66 13500 0.3059 0.3875 0.0782
0.6174 90.91 14000 0.3084 0.3862 0.0780
0.6169 94.16 14500 0.3070 0.3860 0.0779
0.6166 97.4 15000 0.3066 0.3855 0.0778

Framework versions

  • Transformers 4.16.2
  • Pytorch 1.10.2+cu102
  • Datasets 1.18.3
  • Tokenizers 0.11.0