pere commited on
Commit
62b28aa
1 Parent(s): 45ac184

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -117
README.md CHANGED
@@ -1,117 +1,118 @@
1
- ---
2
- license: apache-2.0
3
- tags:
4
- - automatic-speech-recognition
5
- - NbAiLab/NPSC
6
- - no
7
- - nb
8
- - nb-NO
9
- datasets:
10
- - NbAiLab/NPSC
11
- language:
12
- - nb-NO
13
- model-index:
14
- - name: nb-wav2vec2-1b-bokmaal
15
- results:
16
- - task:
17
- name: Automatic Speech Recognition
18
- type: automatic-speech-recognition
19
- dataset:
20
- name: NPSC
21
- type: NbAiLab/NPSC
22
- args: 16K_mp3_bokmaal
23
- metrics:
24
- - name: Test (Bokmål) WER
25
- type: wer
26
- value: 0.0633
27
- - name: Test (Bokmål) CER
28
- type: cer
29
- value: 0.0248
30
- ---
31
-
32
- # Norwegian Wav2Vec2 Model - 1B Bokmål
33
- This model is finetuned on top of feature extractor [XLS-R](https://huggingface.co/facebook/wav2vec2-xls-r-1b) from Facebook/Meta. The finetuned model achieves the following results on the test set with a 5-gram KenLM. The numbers in parentheses are the results without the language model:
34
- - **WER: 0.0633** (0.0738)
35
- - **CER: 0.0248** (0.0263)
36
-
37
- ## Model description
38
- This is one of several Wav2Vec-models our team created during the 🤗 hosted [Robust Speech Event](https://discuss.huggingface.co/t/open-to-the-community-robust-speech-recognition-challenge/13614?s=09). This is the complete list of our models and their final scores:
39
-
40
- | Model | Final WER | |
41
- |:--------------|:------------|:------------:|
42
- | NbAiLab/nb-wav2vec2-1b-bokmaal (this model) | 6.33 | |
43
- | [NbAiLab/nb-wav2vec2-300m-bokmaal](https://huggingface.co/NbAiLab/nb-wav2vec2-300m-bokmaal) | 7.03 | |
44
- | [NbAiLab/nb-wav2vec2-300m-nynorsk](https://huggingface.co/NbAiLab/nb-wav2vec2-300m-nynorsk) | 12.22 | |
45
-
46
- ## Dataset
47
- In parallel with the event, the team also converted the [Norwegian Parliamentary Speech Corpus (NPSC)](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/) to the [NbAiLab/NPSC](https://huggingface.co/datasets/NbAiLab/NPSC) in 🤗 Dataset format and used that as the main source for training.
48
-
49
- ## Code
50
- We have released all the code developed during the event so that the Norwegian NLP community can build upon it when developing even better Norwegian ASR models. The finetuning of these models is not very computationally demanding. After following the instructions here, you should be able to train your own automatic speech recognition system in less than a day with an average GPU.
51
-
52
- ## Team
53
- The following people contributed to building this model: Rolv-Arild Braaten, Per Egil Kummervold, Andre Kåsen, Javier de la Rosa, Per Erik Solberg, and Freddy Wetjen.
54
-
55
- ## Training procedure
56
- To reproduce these results, we strongly recommend that you follow the [instructions from 🤗](https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust-speech-event#talks) to train a simple Swedish model.
57
-
58
- When you have verified that you are able to do this, create a fresh new repo. You can then start by copying the files ```run.sh``` and ```run_speech_recognition_ctc.py``` from our repo. Running these will create all the other necessary files, and should let you reproduce our results. With some tweaks to the hyperparameters, you might even be able to build an even better ASR. Good luck!
59
-
60
- ### Language Model
61
- As the scores indicate, adding even a simple 5-gram language will improve the results. 🤗 has provided another [very nice blog](https://huggingface.co/blog/wav2vec2-with-ngram) explaining how to add a 5-gram language model to improve the ASR model. You can build this from your own corpus, for instance by extracting some suitable text from the [Norwegian Colossal Corpus](https://huggingface.co/datasets/NbAiLab/NCC). You can also skip some of the steps in the guide, and copy the [5-gram model from this repo](https://huggingface.co/NbAiLab/XLSR-300M-bokmaal/tree/main/language_model).
62
-
63
-
64
- ### Parameters
65
- The final model was run using these parameters:
66
- ```
67
- --dataset_name="NbAiLab/NPSC"
68
- --model_name_or_path="facebook/wav2vec2-xls-r-1b"
69
- --dataset_config_name="16K_mp3_bokmaal"
70
- --output_dir="./"
71
- --overwrite_output_dir
72
- --num_train_epochs="40"
73
- --per_device_train_batch_size="12"
74
- --per_device_eval_batch_size="12"
75
- --gradient_accumulation_steps="2"
76
- --learning_rate="2e-5"
77
- --warmup_steps="2000"
78
- --length_column_name="input_length"
79
- --evaluation_strategy="steps"
80
- --text_column_name="text"
81
- --save_steps="500"
82
- --eval_steps="500"
83
- --logging_steps="100"
84
- --layerdrop="0.041"
85
- --attention_dropout="0.094"
86
- --activation_dropout="0.055"
87
- --hidden_dropout="0.047"
88
- --save_total_limit="3"
89
- --freeze_feature_encoder
90
- --feat_proj_dropout="0.04"
91
- --mask_time_prob="0.082"
92
- --mask_time_length="10"
93
- --mask_feature_prob="0.25"
94
- --mask_feature_length="64"
95
- --gradient_checkpointing
96
- --min_duration_in_seconds="0.5"
97
- --max_duration_in_seconds="30.0"
98
- --ctc_zero_infinity=True
99
- --use_auth_token
100
- --seed="42"
101
- --fp16
102
- --group_by_length
103
- --do_train --do_eval
104
- --push_to_hub
105
- --preprocessing_num_workers="16"
106
- ```
107
-
108
- Using these settings, the training might take 3-4 days on an average GPU. You can, however, get a decent model and faster results by tweaking these parameters.
109
-
110
- | Parameter| Comment |
111
- |:-------------|:-----|
112
- | per_device_train_batch_size | Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system |
113
- |gradient_accumulation_steps |Can be adjusted even further up to increase batch size and speed up training without running into memory issues |
114
- | learning_rate|Can be increased, maybe as high as 1e-4. Speeds up training but might add instability |
115
- | epochs| Can be decreased significantly. This is a huge dataset and you might get a decent result already after a couple of epochs|
116
-
117
-
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - automatic-speech-recognition
5
+ - NbAiLab/NPSC
6
+ - no
7
+ - nb
8
+ - nb-NO
9
+ datasets:
10
+ - NbAiLab/NPSC
11
+ language:
12
+ - nb
13
+ - no
14
+ model-index:
15
+ - name: nb-wav2vec2-1b-bokmaal
16
+ results:
17
+ - task:
18
+ name: Automatic Speech Recognition
19
+ type: automatic-speech-recognition
20
+ dataset:
21
+ name: NPSC
22
+ type: NbAiLab/NPSC
23
+ args: 16K_mp3_bokmaal
24
+ metrics:
25
+ - name: Test (Bokmål) WER
26
+ type: wer
27
+ value: 0.0633
28
+ - name: Test (Bokmål) CER
29
+ type: cer
30
+ value: 0.0248
31
+ ---
32
+
33
+ # Norwegian Wav2Vec2 Model - 1B Bokmål
34
+ This model is finetuned on top of feature extractor [XLS-R](https://huggingface.co/facebook/wav2vec2-xls-r-1b) from Facebook/Meta. The finetuned model achieves the following results on the test set with a 5-gram KenLM. The numbers in parentheses are the results without the language model:
35
+ - **WER: 0.0633** (0.0738)
36
+ - **CER: 0.0248** (0.0263)
37
+
38
+ ## Model description
39
+ This is one of several Wav2Vec-models our team created during the 🤗 hosted [Robust Speech Event](https://discuss.huggingface.co/t/open-to-the-community-robust-speech-recognition-challenge/13614?s=09). This is the complete list of our models and their final scores:
40
+
41
+ | Model | Final WER | |
42
+ |:--------------|:------------|:------------:|
43
+ | NbAiLab/nb-wav2vec2-1b-bokmaal (this model) | 6.33 | |
44
+ | [NbAiLab/nb-wav2vec2-300m-bokmaal](https://huggingface.co/NbAiLab/nb-wav2vec2-300m-bokmaal) | 7.03 | |
45
+ | [NbAiLab/nb-wav2vec2-300m-nynorsk](https://huggingface.co/NbAiLab/nb-wav2vec2-300m-nynorsk) | 12.22 | |
46
+
47
+ ## Dataset
48
+ In parallel with the event, the team also converted the [Norwegian Parliamentary Speech Corpus (NPSC)](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/) to the [NbAiLab/NPSC](https://huggingface.co/datasets/NbAiLab/NPSC) in 🤗 Dataset format and used that as the main source for training.
49
+
50
+ ## Code
51
+ We have released all the code developed during the event so that the Norwegian NLP community can build upon it when developing even better Norwegian ASR models. The finetuning of these models is not very computationally demanding. After following the instructions here, you should be able to train your own automatic speech recognition system in less than a day with an average GPU.
52
+
53
+ ## Team
54
+ The following people contributed to building this model: Rolv-Arild Braaten, Per Egil Kummervold, Andre Kåsen, Javier de la Rosa, Per Erik Solberg, and Freddy Wetjen.
55
+
56
+ ## Training procedure
57
+ To reproduce these results, we strongly recommend that you follow the [instructions from 🤗](https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust-speech-event#talks) to train a simple Swedish model.
58
+
59
+ When you have verified that you are able to do this, create a fresh new repo. You can then start by copying the files ```run.sh``` and ```run_speech_recognition_ctc.py``` from our repo. Running these will create all the other necessary files, and should let you reproduce our results. With some tweaks to the hyperparameters, you might even be able to build an even better ASR. Good luck!
60
+
61
+ ### Language Model
62
+ As the scores indicate, adding even a simple 5-gram language will improve the results. 🤗 has provided another [very nice blog](https://huggingface.co/blog/wav2vec2-with-ngram) explaining how to add a 5-gram language model to improve the ASR model. You can build this from your own corpus, for instance by extracting some suitable text from the [Norwegian Colossal Corpus](https://huggingface.co/datasets/NbAiLab/NCC). You can also skip some of the steps in the guide, and copy the [5-gram model from this repo](https://huggingface.co/NbAiLab/XLSR-300M-bokmaal/tree/main/language_model).
63
+
64
+
65
+ ### Parameters
66
+ The final model was run using these parameters:
67
+ ```
68
+ --dataset_name="NbAiLab/NPSC"
69
+ --model_name_or_path="facebook/wav2vec2-xls-r-1b"
70
+ --dataset_config_name="16K_mp3_bokmaal"
71
+ --output_dir="./"
72
+ --overwrite_output_dir
73
+ --num_train_epochs="40"
74
+ --per_device_train_batch_size="12"
75
+ --per_device_eval_batch_size="12"
76
+ --gradient_accumulation_steps="2"
77
+ --learning_rate="2e-5"
78
+ --warmup_steps="2000"
79
+ --length_column_name="input_length"
80
+ --evaluation_strategy="steps"
81
+ --text_column_name="text"
82
+ --save_steps="500"
83
+ --eval_steps="500"
84
+ --logging_steps="100"
85
+ --layerdrop="0.041"
86
+ --attention_dropout="0.094"
87
+ --activation_dropout="0.055"
88
+ --hidden_dropout="0.047"
89
+ --save_total_limit="3"
90
+ --freeze_feature_encoder
91
+ --feat_proj_dropout="0.04"
92
+ --mask_time_prob="0.082"
93
+ --mask_time_length="10"
94
+ --mask_feature_prob="0.25"
95
+ --mask_feature_length="64"
96
+ --gradient_checkpointing
97
+ --min_duration_in_seconds="0.5"
98
+ --max_duration_in_seconds="30.0"
99
+ --ctc_zero_infinity=True
100
+ --use_auth_token
101
+ --seed="42"
102
+ --fp16
103
+ --group_by_length
104
+ --do_train --do_eval
105
+ --push_to_hub
106
+ --preprocessing_num_workers="16"
107
+ ```
108
+
109
+ Using these settings, the training might take 3-4 days on an average GPU. You can, however, get a decent model and faster results by tweaking these parameters.
110
+
111
+ | Parameter| Comment |
112
+ |:-------------|:-----|
113
+ | per_device_train_batch_size | Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system |
114
+ |gradient_accumulation_steps |Can be adjusted even further up to increase batch size and speed up training without running into memory issues |
115
+ | learning_rate|Can be increased, maybe as high as 1e-4. Speeds up training but might add instability |
116
+ | epochs| Can be decreased significantly. This is a huge dataset and you might get a decent result already after a couple of epochs|
117
+
118
+