File size: 3,130 Bytes
cdfd811
45ed7c7
cdfd811
fb235fd
 
cdfd811
45ed7c7
 
fb235fd
cdfd811
fb235fd
 
 
 
 
 
 
cdfd811
fb235fd
cdfd811
45ed7c7
cdfd811
fb235fd
cdfd811
fb235fd
 
 
cdfd811
45ed7c7
 
 
 
 
 
 
 
 
 
 
 
cdfd811
fb235fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
language:
- bn
license: apache-2.0
tags:
- automatic-speech-recognition
- bn
- hf-asr-leaderboard
- openslr_SLR53
- robust-speech-event
datasets:
- openslr
- SLR53
- AI4Bharat/IndicCorp
metrics:
- wer
- cer
model-index:
- name: arijitx/wav2vec2-xls-r-300m-bengali
  results:
  - task:
      type: automatic-speech-recognition
      name: Speech Recognition
    dataset:
      type: openslr
      name: Open SLR
      args: SLR53
    metrics:
    - type: wer
      value: 0.21726385291857586
      name: Test WER
    - type: cer
      value: 0.04725010353701041
      name: Test CER
    - type: wer
      value: 0.15322879016421437
      name: Test WER with lm
    - type: cer
      value: 0.03413696666806267
      name: Test CER with lm
---
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the OPENSLR_SLR53 - bengali dataset.
It achieves the following results on the evaluation set. 

Without language model : 
- WER: 0.21726385291857586
- CER: 0.04725010353701041

With 5 gram language model trained on 30M sentences randomly chosen from [AI4Bharat IndicCorp](https://indicnlp.ai4bharat.org/corpora/) dataset : 
- WER: 0.15322879016421437
- CER: 0.03413696666806267

  

Note : 5% of a total 10935 samples have been used for evaluation. Evaluation set has 10935 examples which was not part of training training was done on first 95% and eval was done on last 5%. Training was stopped after 180k steps. Output predictions are available under files section.

### Training hyperparameters

The following hyperparameters were used during training:

- dataset_name="openslr"  	
- model_name_or_path="facebook/wav2vec2-xls-r-300m"  	
- dataset_config_name="SLR53"  	
- output_dir="./wav2vec2-xls-r-300m-bengali"  	
- overwrite_output_dir  	
- num_train_epochs="50"  	
- per_device_train_batch_size="32"  	
- per_device_eval_batch_size="32"  	
- gradient_accumulation_steps="1"  	
- learning_rate="7.5e-5"  	
- warmup_steps="2000"  	
- length_column_name="input_length"  	
- evaluation_strategy="steps"  	
- text_column_name="sentence"  	
- chars_to_ignore , ? . ! \- \; \: \" “ % ‘ ” � — ’ … –  	
- save_steps="2000"  	
- eval_steps="3000"  	
- logging_steps="100"  	
- layerdrop="0.0"  	
- activation_dropout="0.1"  	
- save_total_limit="3"  	
- freeze_feature_encoder  	
- feat_proj_dropout="0.0"  	
- mask_time_prob="0.75"  	
- mask_time_length="10"  	
- mask_feature_prob="0.25"  	
- mask_feature_length="64"      
- preprocessing_num_workers 32  

### Framework versions

- Transformers 4.16.0.dev0
- Pytorch 1.10.1+cu102
- Datasets 1.17.1.dev0
- Tokenizers 0.11.0

Notes
- Training and eval code modified from : https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust-speech-event. 
- Bengali speech data was not available from common voice or librispeech multilingual datasets, so OpenSLR53 has been used.
- Minimum audio duration of 0.5s has been used to filter the training data which excluded may be 10-20 samples.
- OpenSLR53 transcripts are *not* part of LM training and LM used to evaluate.