arijitx commited on
Commit
fb235fd
1 Parent(s): ccfa007

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -19
README.md CHANGED
@@ -1,29 +1,100 @@
1
  ---
2
- language: Bengali
3
- datasets:
4
- - OpenSLR
5
- metrics:
6
- - wer
7
- tags:
8
  - bn
9
- - audio
 
10
  - automatic-speech-recognition
11
- - speech
12
  - robust-speech-event
13
- license: cc-by-sa-4.0
 
 
 
 
 
 
 
14
  model-index:
15
- - name: XLSR Wav2Vec2 Bengali by Arijit
16
  results:
17
- - task:
18
- name: Speech Recognition
19
  type: automatic-speech-recognition
 
20
  dataset:
21
- name: OpenSLR
22
- type: OpenSLR
23
- args: ben
24
  metrics:
25
- - name: Test WER
26
- type: wer
27
- value: 32.45
 
 
 
 
 
 
 
 
 
28
  ---
29
- # wav2vec2-xls-r-300m-bangla
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
 
 
 
 
 
3
  - bn
4
+ license: apache-2.0
5
+ tags:
6
  - automatic-speech-recognition
7
+ - openslr_SLR53
8
  - robust-speech-event
9
+ - bn
10
+ datasets:
11
+ - openslr
12
+ - SLR53
13
+ - AI4Bharat/IndicCorp
14
+ metrics:
15
+ - wer
16
+ - cer
17
  model-index:
18
+ - name: arijitx/wav2vec2-xls-r-300m-bengali
19
  results:
20
+ - task:
 
21
  type: automatic-speech-recognition
22
+ name: Speech Recognition
23
  dataset:
24
+ type: openslr
25
+ name: Open SLR
26
+ args: SLR53
27
  metrics:
28
+ - type: wer # Required. Example: wer
29
+ value: 0.21726385291857586 # Required. Example: 20.90
30
+ name: Test WER # Optional. Example: Test WER
31
+ - type: cer
32
+ value: 0.04725010353701041
33
+ name: Test CER
34
+ - type: wer # Required. Example: wer
35
+ value: 0.15322879016421437 # Required. Example: 20.90
36
+ name: Test WER with lm # Optional. Example: Test WER
37
+ - type: cer
38
+ value: 0.03413696666806267
39
+ name: Test CER with lm
40
  ---
41
+ This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the OPENSLR_SLR53 - bengali dataset.
42
+ It achieves the following results on the evaluation set.
43
+
44
+ Without language model :
45
+ - WER: 0.21726385291857586
46
+ - CER: 0.04725010353701041
47
+
48
+ With 5 gram language model trained on 30M sentences randomly chosen from [AI4Bharat IndicCorp](https://indicnlp.ai4bharat.org/corpora/) dataset :
49
+ - WER: 0.15322879016421437
50
+ - CER: 0.03413696666806267
51
+
52
+
53
+
54
+ Note : 5% of a total 10935 samples have been used for evaluation. Evaluation set has 10935 examples which was not part of training training was done on first 95% and eval was done on last 5%. Training was stopped after 180k steps. Output predictions are available under files section.
55
+
56
+ ### Training hyperparameters
57
+
58
+ The following hyperparameters were used during training:
59
+
60
+ - dataset_name="openslr"
61
+ - model_name_or_path="facebook/wav2vec2-xls-r-300m"
62
+ - dataset_config_name="SLR53"
63
+ - output_dir="./wav2vec2-xls-r-300m-bengali"
64
+ - overwrite_output_dir
65
+ - num_train_epochs="50"
66
+ - per_device_train_batch_size="32"
67
+ - per_device_eval_batch_size="32"
68
+ - gradient_accumulation_steps="1"
69
+ - learning_rate="7.5e-5"
70
+ - warmup_steps="2000"
71
+ - length_column_name="input_length"
72
+ - evaluation_strategy="steps"
73
+ - text_column_name="sentence"
74
+ - chars_to_ignore , ? . ! \- \; \: \" “ % ‘ ” � — ’ … –
75
+ - save_steps="2000"
76
+ - eval_steps="3000"
77
+ - logging_steps="100"
78
+ - layerdrop="0.0"
79
+ - activation_dropout="0.1"
80
+ - save_total_limit="3"
81
+ - freeze_feature_encoder
82
+ - feat_proj_dropout="0.0"
83
+ - mask_time_prob="0.75"
84
+ - mask_time_length="10"
85
+ - mask_feature_prob="0.25"
86
+ - mask_feature_length="64"
87
+ - preprocessing_num_workers 32
88
+
89
+ ### Framework versions
90
+
91
+ - Transformers 4.16.0.dev0
92
+ - Pytorch 1.10.1+cu102
93
+ - Datasets 1.17.1.dev0
94
+ - Tokenizers 0.11.0
95
+
96
+ Notes
97
+ - Training and eval code modified from : https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust-speech-event.
98
+ - Bengali speech data was not available from common voice or librispeech multilingual datasets, so OpenSLR53 has been used.
99
+ - Minimum audio duration of 0.5s has been used to filter the training data which excluded may be 10-20 samples.
100
+ - OpenSLR53 transcripts are *not* part of LM training and LM used to evaluate.