Model description

This model is a fine-tuned version of facebook/wav2vec2-xls-r-1b on my collection of Public Japanese Voice datasets for research Common Voice 7.0, JUST (Japanese speech corpus of Saruwatari-lab., University of Tokyo), JSSS (Japanese speech corpus for summarization and simplification), CSS10 (A collection of single speaker speech datasets). You can find in preprocessing dataset in here VUMICHIEN/COMMON_VOICE_LARGE_JSUT_JSSS_CSS10.

Total training data:

~60 hours

Benchmark WER result:

COMMON VOICE 7.0 COMMON VOICE 8.0
without LM 10.96 10.91
with 4-grams LM 7.98 7.88

Benchmark CER result:

COMMON VOICE 7.0 COMMON VOICE 8.0
without LM 4.28 4.22
with 4-grams LM 3.42 3.35

Evaluation

Please use the eval.py file to run the evaluation:

pip install mecab-python3 unidic-lite pykakasi
python eval.py --model_id vumichien/wav2vec2-xls-r-1b-japanese --dataset mozilla-foundation/common_voice_7_0 --config ja --split test --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 64
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1000
  • num_epochs: 100.0
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer Cer
2.2896 3.37 1500 0.4748 0.4013 0.1767
1.1608 6.74 3000 0.3350 0.3159 0.1456
1.1042 10.11 4500 0.3119 0.2971 0.1400
1.0494 13.48 6000 0.2974 0.2867 0.1353
1.0061 16.85 7500 0.2802 0.2746 0.1300
0.9629 20.22 9000 0.2844 0.2776 0.1326
0.9267 23.59 10500 0.2577 0.2603 0.1255
0.8984 26.96 12000 0.2508 0.2531 0.1226
0.8729 30.34 13500 0.2629 0.2606 0.1254
0.8546 33.71 15000 0.2402 0.2447 0.1193
0.8304 37.08 16500 0.2532 0.2472 0.1209
0.8075 40.45 18000 0.2439 0.2469 0.1198
0.7827 43.82 19500 0.2387 0.2372 0.1167
0.7627 47.19 21000 0.2344 0.2331 0.1147
0.7402 50.56 22500 0.2314 0.2299 0.1135
0.718 53.93 24000 0.2257 0.2267 0.1114
0.7016 57.3 25500 0.2204 0.2184 0.1089
0.6804 60.67 27000 0.2227 0.2181 0.1085
0.6625 64.04 28500 0.2138 0.2112 0.1058
0.6465 67.42 30000 0.2141 0.2081 0.1044
0.6238 70.79 31500 0.2172 0.2082 0.1050
0.6062 74.16 33000 0.2174 0.2058 0.1043
0.588 77.53 34500 0.2156 0.2034 0.1027
0.5722 80.9 36000 0.2162 0.2032 0.1029
0.5585 84.27 37500 0.2156 0.2022 0.1021
0.5456 87.64 39000 0.2126 0.1993 0.1009
0.5325 91.01 40500 0.2121 0.1966 0.1003
0.5229 94.38 42000 0.2104 0.1941 0.0991
0.5134 97.75 43500 0.2108 0.1948 0.0992

Framework versions

  • Transformers 4.16.0.dev0
  • Pytorch 1.10.1+cu102
  • Datasets 1.17.1.dev0
  • Tokenizers 0.11.0
Downloads last month
22
Safetensors
Model size
963M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train vumichien/wav2vec2-xls-r-1b-japanese

Evaluation results