anton-l's picture
anton-l HF staff
Upload README.md
563903f
metadata
language: zh-HK
license: apache-2.0
tags:
  - automatic-speech-recognition
  - generated_from_trainer
  - hf-asr-leaderboard
  - robust-speech-event
datasets:
  - common_voice
model-index:
  - name: Wav2Vec2 XLS-R 300M Cantonese (zh-HK) LM
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice
          type: common_voice
          args: zh-HK
        metrics:
          - name: Test CER
            type: cer
            value: 24.09
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 7
          type: mozilla-foundation/common_voice_7_0
          args: zh-HK
        metrics:
          - name: Test CER
            type: cer
            value: 23.1
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 8
          type: mozilla-foundation/common_voice_8_0
          args: zh-HK
        metrics:
          - name: Test CER
            type: cer
            value: 23.02
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Robust Speech Event - Dev Data
          type: speech-recognition-community-v2/dev_data
          args: zh-HK
        metrics:
          - name: Test CER
            type: cer
            value: 56.86
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Robust Speech Event - Test Data
          type: speech-recognition-community-v2/eval_data
          args: zh-HK
        metrics:
          - name: Test CER
            type: cer
            value: 55.76

Wav2Vec2 XLS-R 300M Cantonese (zh-HK) LM

Wav2Vec2 XLS-R 300M Cantonese (zh-HK) LM is an automatic speech recognition model based on the XLS-R architecture. This model is a fine-tuned version of Wav2Vec2-XLS-R-300M on the zh-HK subset of the Common Voice dataset. A 5-gram Language model, trained on multiple PyCantonese corpora, was then subsequently added to this model.

This model was trained using HuggingFace's PyTorch framework and is part of the Robust Speech Challenge Event organized by HuggingFace. All training was done on a Tesla V100, sponsored by OVH.

All necessary scripts used for training could be found in the Files and versions tab, as well as the Training metrics logged via Tensorboard.

As for the N-gram language model training, we followed the blog post tutorial provided by HuggingFace.

Model

Model #params Arch. Training/Validation data (text)
wav2vec2-xls-r-300m-zh-HK-lm-v2 300M XLS-R Common Voice zh-HK Dataset

Evaluation Results

The model achieves the following results on evaluation without a language model:

Dataset CER
Common Voice 31.73%
Common Voice 7 23.11%
Common Voice 8 23.02%
Robust Speech Event - Dev Data 56.60%

With the addition of the language model, it achieves the following results:

Dataset CER
Common Voice 24.09%
Common Voice 7 23.10%
Common Voice 8 23.02%
Robust Speech Event - Dev Data 56.86%

Training procedure

The training process did not involve the addition of a language model. The following results were simply lifted from the original automatic speech recognition model training.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 2000
  • num_epochs: 100.0
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer Cer
69.8341 1.34 500 80.0722 1.0 1.0
6.6418 2.68 1000 6.6346 1.0 1.0
6.2419 4.02 1500 6.2909 1.0 1.0
6.0813 5.36 2000 6.1150 1.0 1.0
5.9677 6.7 2500 6.0301 1.1386 1.0028
5.9296 8.04 3000 5.8975 1.2113 1.0058
5.6434 9.38 3500 5.5404 2.1624 1.0171
5.1974 10.72 4000 4.5440 2.1702 0.9366
4.3601 12.06 4500 3.3839 2.2464 0.8998
3.9321 13.4 5000 2.8785 2.3097 0.8400
3.6462 14.74 5500 2.5108 1.9623 0.6663
3.5156 16.09 6000 2.2790 1.6479 0.5706
3.32 17.43 6500 2.1450 1.8337 0.6244
3.1918 18.77 7000 1.8536 1.9394 0.6017
3.1139 20.11 7500 1.7205 1.9112 0.5638
2.8995 21.45 8000 1.5478 1.0624 0.3250
2.7572 22.79 8500 1.4068 1.1412 0.3367
2.6881 24.13 9000 1.3312 2.0100 0.5683
2.5993 25.47 9500 1.2553 2.0039 0.6450
2.5304 26.81 10000 1.2422 2.0394 0.5789
2.4352 28.15 10500 1.1582 1.9970 0.5507
2.3795 29.49 11000 1.1160 1.8255 0.4844
2.3287 30.83 11500 1.0775 1.4123 0.3780
2.2622 32.17 12000 1.0704 1.7445 0.4894
2.2225 33.51 12500 1.0272 1.7237 0.5058
2.1843 34.85 13000 0.9756 1.8042 0.5028
2.1 36.19 13500 0.9527 1.8909 0.6055
2.0741 37.53 14000 0.9418 1.9026 0.5880
2.0179 38.87 14500 0.9363 1.7977 0.5246
2.0615 40.21 15000 0.9635 1.8112 0.5599
1.9448 41.55 15500 0.9249 1.7250 0.4914
1.8966 42.89 16000 0.9023 1.5829 0.4319
1.8662 44.24 16500 0.9002 1.4833 0.4230
1.8136 45.58 17000 0.9076 1.1828 0.2987
1.7908 46.92 17500 0.8774 1.5773 0.4258
1.7354 48.26 18000 0.8727 1.5037 0.4024
1.6739 49.6 18500 0.8636 1.1239 0.2789
1.6457 50.94 19000 0.8516 1.2269 0.3104
1.5847 52.28 19500 0.8399 1.3309 0.3360
1.5971 53.62 20000 0.8441 1.3153 0.3335
1.602 54.96 20500 0.8590 1.2932 0.3433
1.5063 56.3 21000 0.8334 1.1312 0.2875
1.4631 57.64 21500 0.8474 1.1698 0.2999
1.4997 58.98 22000 0.8638 1.4279 0.3854
1.4301 60.32 22500 0.8550 1.2737 0.3300
1.3798 61.66 23000 0.8266 1.1802 0.2934
1.3454 63.0 23500 0.8235 1.3816 0.3711
1.3678 64.34 24000 0.8550 1.6427 0.5035
1.3761 65.68 24500 0.8510 1.6709 0.4907
1.2668 67.02 25000 0.8515 1.5842 0.4505
1.2835 68.36 25500 0.8283 1.5353 0.4221
1.2961 69.7 26000 0.8339 1.5743 0.4369
1.2656 71.05 26500 0.8331 1.5331 0.4217
1.2556 72.39 27000 0.8242 1.4708 0.4109
1.2043 73.73 27500 0.8245 1.4469 0.4031
1.2722 75.07 28000 0.8202 1.4924 0.4096
1.202 76.41 28500 0.8290 1.3807 0.3719
1.1679 77.75 29000 0.8195 1.4097 0.3749
1.1967 79.09 29500 0.8059 1.2074 0.3077
1.1241 80.43 30000 0.8137 1.2451 0.3270
1.1414 81.77 30500 0.8117 1.2031 0.3121
1.132 83.11 31000 0.8234 1.4266 0.3901
1.0982 84.45 31500 0.8064 1.3712 0.3607
1.0797 85.79 32000 0.8167 1.3356 0.3562
1.0119 87.13 32500 0.8215 1.2754 0.3268
1.0216 88.47 33000 0.8163 1.2512 0.3184
1.0375 89.81 33500 0.8137 1.2685 0.3290
0.9794 91.15 34000 0.8220 1.2724 0.3255
1.0207 92.49 34500 0.8165 1.2906 0.3361
1.0169 93.83 35000 0.8153 1.2819 0.3305
1.0127 95.17 35500 0.8187 1.2832 0.3252
0.9978 96.51 36000 0.8111 1.2612 0.3210
0.9923 97.85 36500 0.8076 1.2278 0.3122
1.0451 99.2 37000 0.8086 1.2451 0.3156

Disclaimer

Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

Authors

Wav2Vec2 XLS-R 300M Cantonese (zh-HK) LM was trained and evaluated by Wilson Wongso. All computation and development are done on OVH Cloud.

Framework versions

  • Transformers 4.17.0.dev0
  • Pytorch 1.10.2+cu102
  • Datasets 1.18.4.dev0
  • Tokenizers 0.11.0