w11wo's picture
Librarian Bot: Add base_model information to model (#2)
eee5686
metadata
language: ko
license: apache-2.0
tags:
  - automatic-speech-recognition
  - generated_from_trainer
  - hf-asr-leaderboard
  - robust-speech-event
datasets:
  - kresnik/zeroth_korean
base_model: Wav2Vec2-XLS-R-300M
model-index:
  - name: Wav2Vec2 XLS-R 300M Korean
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Zeroth Korean
          type: kresnik/zeroth_korean
          args: clean
        metrics:
          - type: wer
            value: 29.54
            name: Test WER
          - type: cer
            value: 9.53
            name: Test CER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Robust Speech Event - Dev Data
          type: speech-recognition-community-v2/dev_data
          args: ko
        metrics:
          - type: wer
            value: 76.26
            name: Test WER
          - type: cer
            value: 38.67
            name: Test CER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Robust Speech Event - Test Data
          type: speech-recognition-community-v2/eval_data
          args: ko
        metrics:
          - type: wer
            value: 73.18
            name: Test WER

Wav2Vec2 XLS-R 300M Korean

Wav2Vec2 XLS-R 300M Korean is an automatic speech recognition model based on the XLS-R architecture. This model is a fine-tuned version of Wav2Vec2-XLS-R-300M on the Zeroth Korean dataset.

This model was trained using HuggingFace's PyTorch framework and is part of the Robust Speech Challenge Event organized by HuggingFace. All training was done on a Tesla V100, sponsored by OVH.

All necessary scripts used for training could be found in the Files and versions tab, as well as the Training metrics logged via Tensorboard.

Model

Model #params Arch. Training/Validation data (text)
wav2vec2-xls-r-300m-korean 300M XLS-R Zeroth Korean Dataset

Evaluation Results

The model achieves the following results on evaluation:

Dataset Loss WER CER
Zeroth Korean 0.2089 29.54% 9.53%
Robust Speech Event - Dev Data N/A 76.26% 38.67%

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 7.5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 2000
  • num_epochs: 50.0
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer Cer
19.7138 0.72 500 19.6427 1.0 1.0
4.8039 1.44 1000 4.7842 1.0 1.0
4.5619 2.16 1500 4.5608 0.9992 0.9598
4.254 2.88 2000 4.2729 0.9955 0.9063
4.1905 3.6 2500 4.2257 0.9903 0.8758
4.0683 4.32 3000 3.9294 0.9937 0.7911
3.486 5.04 3500 2.7045 1.0012 0.5934
2.946 5.75 4000 1.9691 0.9425 0.4634
2.634 6.47 4500 1.5212 0.8807 0.3850
2.4066 7.19 5000 1.2551 0.8177 0.3601
2.2651 7.91 5500 1.0423 0.7650 0.3039
2.1828 8.63 6000 0.9599 0.7273 0.3106
2.1023 9.35 6500 0.9482 0.7161 0.3063
2.0536 10.07 7000 0.8242 0.6767 0.2860
1.9803 10.79 7500 0.7643 0.6563 0.2637
1.9468 11.51 8000 0.7319 0.6441 0.2505
1.9178 12.23 8500 0.6937 0.6320 0.2489
1.8515 12.95 9000 0.6443 0.6053 0.2196
1.8083 13.67 9500 0.6286 0.6122 0.2148
1.819 14.39 10000 0.6015 0.5986 0.2074
1.7684 15.11 10500 0.5682 0.5741 0.1982
1.7195 15.83 11000 0.5385 0.5592 0.2007
1.7044 16.55 11500 0.5362 0.5524 0.2097
1.6879 17.27 12000 0.5119 0.5489 0.2083
1.656 17.98 12500 0.4990 0.5362 0.1968
1.6122 18.7 13000 0.4561 0.5092 0.1900
1.5919 19.42 13500 0.4778 0.5225 0.1975
1.5896 20.14 14000 0.4563 0.5098 0.1859
1.5589 20.86 14500 0.4362 0.4940 0.1725
1.5353 21.58 15000 0.4140 0.4826 0.1580
1.5441 22.3 15500 0.4031 0.4742 0.1550
1.5116 23.02 16000 0.3916 0.4748 0.1545
1.4731 23.74 16500 0.3841 0.4810 0.1542
1.4647 24.46 17000 0.3752 0.4524 0.1475
1.4328 25.18 17500 0.3587 0.4476 0.1461
1.4129 25.9 18000 0.3429 0.4242 0.1366
1.4062 26.62 18500 0.3450 0.4251 0.1355
1.3928 27.34 19000 0.3297 0.4145 0.1322
1.3906 28.06 19500 0.3210 0.4185 0.1336
1.358 28.78 20000 0.3131 0.3970 0.1275
1.3445 29.5 20500 0.3069 0.3920 0.1276
1.3159 30.22 21000 0.3035 0.3961 0.1255
1.3044 30.93 21500 0.2952 0.3854 0.1242
1.3034 31.65 22000 0.2966 0.3772 0.1227
1.2963 32.37 22500 0.2844 0.3706 0.1208
1.2765 33.09 23000 0.2841 0.3567 0.1173
1.2438 33.81 23500 0.2734 0.3552 0.1137
1.2487 34.53 24000 0.2703 0.3502 0.1118
1.2249 35.25 24500 0.2650 0.3484 0.1142
1.2229 35.97 25000 0.2584 0.3374 0.1097
1.2374 36.69 25500 0.2568 0.3337 0.1095
1.2153 37.41 26000 0.2494 0.3327 0.1071
1.1925 38.13 26500 0.2518 0.3366 0.1077
1.1908 38.85 27000 0.2437 0.3272 0.1057
1.1858 39.57 27500 0.2396 0.3265 0.1044
1.1808 40.29 28000 0.2373 0.3156 0.1028
1.1842 41.01 28500 0.2356 0.3152 0.1026
1.1668 41.73 29000 0.2319 0.3188 0.1025
1.1448 42.45 29500 0.2293 0.3099 0.0995
1.1327 43.17 30000 0.2265 0.3047 0.0979
1.1307 43.88 30500 0.2222 0.3078 0.0989
1.1419 44.6 31000 0.2215 0.3038 0.0981
1.1231 45.32 31500 0.2193 0.3013 0.0972
1.139 46.04 32000 0.2162 0.3007 0.0968
1.1114 46.76 32500 0.2122 0.2982 0.0960
1.111 47.48 33000 0.2125 0.2946 0.0948
1.0982 48.2 33500 0.2099 0.2957 0.0953
1.109 48.92 34000 0.2092 0.2955 0.0955
1.0905 49.64 34500 0.2088 0.2954 0.0953

Disclaimer

Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

Authors

Wav2Vec2 XLS-R 300M Korean was trained and evaluated by Wilson Wongso. All computation and development are done on OVH Cloud.

Framework versions

  • Transformers 4.17.0.dev0
  • Pytorch 1.10.2+cu102
  • Datasets 1.18.2.dev0
  • Tokenizers 0.10.3