poonehmousavi commited on
Commit
bb52bb9
1 Parent(s): eac28d8

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +40 -22
  2. config.json +68 -75
  3. hyperparams.yaml +51 -79
  4. preprocessor_config.json +7 -8
README.md CHANGED
@@ -1,44 +1,61 @@
1
  ---
2
- language: "en"
3
- thumbnail:
 
4
  pipeline_tag: automatic-speech-recognition
5
  tags:
6
  - CTC
7
- - Attention
8
  - pytorch
9
  - speechbrain
10
  - Transformer
11
- license: "apache-2.0"
12
  datasets:
13
- - commonvoice
14
  metrics:
15
  - wer
16
  - cer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ---
18
 
19
  <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
20
  <br/><br/>
21
 
22
- # wav2vec 2.0 with CTC/Attention trained on CommonVoice Italian (No LM)
23
 
24
  This repository provides all the necessary tools to perform automatic speech
25
- recognition from an end-to-end system pretrained on CommonVoice (Italian Language) within
26
  SpeechBrain. For a better experience, we encourage you to learn more about
27
- [SpeechBrain](https://speechbrain.github.io).
28
 
29
  The performance of the model is the following:
30
 
31
  | Release | Test CER | Test WER | GPUs |
32
- |:--------------:|:--------------:|:--------------:| :--------:|
33
- | 03-06-21 | 2.38 | 8.38 | 1xV100 32GB |
34
 
35
  ## Pipeline description
36
 
37
  This ASR system is composed of 2 different but linked blocks:
38
- - Tokenizer (unigram) that transforms words into subword units and trained with
39
- the train transcriptions (train.tsv) of CommonVoice (EN).
40
- - Acoustic model (wav2vec2.0 + CTC/Attention). A pretrained wav2vec 2.0 model ([facebook/wav2vec2-large-it-voxpopuli](https://huggingface.co/facebook/wav2vec2-large-it-voxpopuli)) is combined with two DNN layers and finetuned on CommonVoice En.
41
- The obtained final acoustic representation is given to the CTC and attention decoders.
42
 
43
  The system is trained with recordings sampled at 16kHz (single channel).
44
  The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *transcribe_file* if needed.
@@ -54,13 +71,13 @@ pip install speechbrain transformers
54
  Please notice that we encourage you to read our tutorials and learn more about
55
  [SpeechBrain](https://speechbrain.github.io).
56
 
57
- ### Transcribing your own audio files (in Italian)
58
 
59
  ```python
60
- from speechbrain.pretrained import EncoderDecoderASR
61
 
62
- asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-14-it", savedir="pretrained_models/asr-wav2vec2-commonvoice-14-it")
63
- asr_model.transcribe_file("speechbrain/asr-wav2vec2-commonvoice-14-it/example-it.wav")
64
 
65
  ```
66
  ### Inference on GPU
@@ -85,15 +102,16 @@ pip install -e .
85
 
86
  3. Run Training:
87
  ```bash
88
- cd recipes/CommonVoice/ASR/seq2seq
89
- python train_with_wav2vec.py hparams/train_it_with_wav2vec.yaml --data_folder=your_data_folder
90
  ```
91
 
92
- You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1tjz6IZmVRkuRE97E7h1cXFoGTer7pT73?usp=sharing).
93
 
94
  ### Limitations
95
  The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
96
 
 
97
  # **About SpeechBrain**
98
  - Website: https://speechbrain.github.io/
99
  - Code: https://github.com/speechbrain/speechbrain/
@@ -113,4 +131,4 @@ Please, cite SpeechBrain if you use it for your research or business.
113
  primaryClass={eess.AS},
114
  note={arXiv:2106.04624}
115
  }
116
- ```
 
1
  ---
2
+ language:
3
+ - en
4
+ thumbnail: null
5
  pipeline_tag: automatic-speech-recognition
6
  tags:
7
  - CTC
 
8
  - pytorch
9
  - speechbrain
10
  - Transformer
11
+ license: apache-2.0
12
  datasets:
13
+ - commonvoice.14.0
14
  metrics:
15
  - wer
16
  - cer
17
+ model-index:
18
+ - name: asr-wav2vec2-commonvoice-14-en
19
+ results:
20
+ - task:
21
+ name: Automatic Speech Recognition
22
+ type: automatic-speech-recognition
23
+ dataset:
24
+ name: CommonVoice Corpus 14.0 (English)
25
+ type: mozilla-foundation/common_voice_14.0
26
+ config: en
27
+ split: test
28
+ args:
29
+ language: en
30
+ metrics:
31
+ - name: Test WER
32
+ type: wer
33
+ value: '16.68'
34
  ---
35
 
36
  <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
37
  <br/><br/>
38
 
39
+ # wav2vec 2.0 with CTC trained on CommonVoice English (No LM)
40
 
41
  This repository provides all the necessary tools to perform automatic speech
42
+ recognition from an end-to-end system pretrained on CommonVoice (English Language) within
43
  SpeechBrain. For a better experience, we encourage you to learn more about
44
+ [SpeechBrain](https://speechbrain.github.io).
45
 
46
  The performance of the model is the following:
47
 
48
  | Release | Test CER | Test WER | GPUs |
49
+ |:-------------:|:--------------:|:--------------:| :--------:|
50
+ | 15-08-23 | 7.92 | 16.86 | 1xV100 32GB |
51
 
52
  ## Pipeline description
53
 
54
  This ASR system is composed of 2 different but linked blocks:
55
+ - Tokenizer (unigram) that transforms words into unigrams and trained with
56
+ the train transcriptions (train.tsv) of CommonVoice (en).
57
+ - Acoustic model (wav2vec2.0 + CTC). A pretrained wav2vec 2.0 model ([wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60)) is combined with two DNN layers and finetuned on CommonVoice DE.
58
+ The obtained final acoustic representation is given to the CTC decoder.
59
 
60
  The system is trained with recordings sampled at 16kHz (single channel).
61
  The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *transcribe_file* if needed.
 
71
  Please notice that we encourage you to read our tutorials and learn more about
72
  [SpeechBrain](https://speechbrain.github.io).
73
 
74
+ ### Transcribing your own audio files (in English)
75
 
76
  ```python
77
+ from speechbrain.pretrained import EncoderASR
78
 
79
+ asr_model = EncoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-14-en", savedir="pretrained_models/asr-wav2vec2-commonvoice-14-en")
80
+ asr_model.transcribe_file("speechbrain/asr-wav2vec2-commonvoice-14-en/example-en.wav")
81
 
82
  ```
83
  ### Inference on GPU
 
102
 
103
  3. Run Training:
104
  ```bash
105
+ cd recipes/CommonVoice/ASR/CTC/
106
+ python train_with_wav2vec.py hparams/train_en_with_wav2vec.yaml --data_folder=your_data_folder
107
  ```
108
 
109
+ You can find our training results (models, logs, etc) [here](https://www.dropbox.com/sh/ch10cnbhf1faz3w/AACdHFG65LC6582H0Tet_glTa?dl=0).
110
 
111
  ### Limitations
112
  The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
113
 
114
+
115
  # **About SpeechBrain**
116
  - Website: https://speechbrain.github.io/
117
  - Code: https://github.com/speechbrain/speechbrain/
 
131
  primaryClass={eess.AS},
132
  note={arXiv:2106.04624}
133
  }
134
+ ```
config.json CHANGED
@@ -1,76 +1,69 @@
1
  {
2
- "speechbrain_interface": "EncoderDecoderASR",
3
- "activation_dropout": 0.0,
4
- "apply_spec_augment": true,
5
- "architectures": [
6
- "Wav2Vec2Model"
7
- ],
8
- "attention_dropout": 0.1,
9
- "bos_token_id": 1,
10
- "conv_bias": true,
11
- "conv_dim": [
12
- 512,
13
- 512,
14
- 512,
15
- 512,
16
- 512,
17
- 512,
18
- 512
19
- ],
20
- "conv_kernel": [
21
- 10,
22
- 3,
23
- 3,
24
- 3,
25
- 3,
26
- 2,
27
- 2
28
- ],
29
- "conv_stride": [
30
- 5,
31
- 2,
32
- 2,
33
- 2,
34
- 2,
35
- 2,
36
- 2
37
- ],
38
- "ctc_loss_reduction": "sum",
39
- "ctc_zero_infinity": false,
40
- "do_stable_layer_norm": true,
41
- "eos_token_id": 2,
42
- "feat_extract_activation": "gelu",
43
- "feat_extract_dropout": 0.0,
44
- "feat_extract_norm": "layer",
45
- "feat_proj_dropout": 0.1,
46
- "final_dropout": 0.0,
47
- "gradient_checkpointing": false,
48
- "hidden_act": "gelu",
49
- "hidden_dropout": 0.1,
50
- "hidden_size": 1024,
51
- "initializer_range": 0.02,
52
- "intermediate_size": 4096,
53
- "layer_norm_eps": 1e-05,
54
- "layerdrop": 0.1,
55
- "mask_channel_length": 10,
56
- "mask_channel_min_space": 1,
57
- "mask_channel_other": 0.0,
58
- "mask_channel_prob": 0.0,
59
- "mask_channel_selection": "static",
60
- "mask_feature_length": 10,
61
- "mask_feature_prob": 0.0,
62
- "mask_time_length": 10,
63
- "mask_time_min_space": 1,
64
- "mask_time_other": 0.0,
65
- "mask_time_prob": 0.075,
66
- "mask_time_selection": "static",
67
- "model_type": "wav2vec2",
68
- "num_attention_heads": 16,
69
- "num_conv_pos_embedding_groups": 16,
70
- "num_conv_pos_embeddings": 128,
71
- "num_feat_extract_layers": 7,
72
- "num_hidden_layers": 24,
73
- "pad_token_id": 0,
74
- "transformers_version": "4.6.0.dev0",
75
- "vocab_size": 32
76
- }
 
1
  {
2
+ "speechbrain_interface": "EncoderASR",
3
+ "activation_dropout": 0.1,
4
+ "apply_spec_augment": true,
5
+ "architectures": [
6
+ "Wav2Vec2Model"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "bos_token_id": 1,
10
+ "conv_bias": true,
11
+ "conv_dim": [
12
+ 512,
13
+ 512,
14
+ 512,
15
+ 512,
16
+ 512,
17
+ 512,
18
+ 512
19
+ ],
20
+ "conv_kernel": [
21
+ 10,
22
+ 3,
23
+ 3,
24
+ 3,
25
+ 3,
26
+ 2,
27
+ 2
28
+ ],
29
+ "conv_stride": [
30
+ 5,
31
+ 2,
32
+ 2,
33
+ 2,
34
+ 2,
35
+ 2,
36
+ 2
37
+ ],
38
+ "ctc_loss_reduction": "sum",
39
+ "ctc_zero_infinity": false,
40
+ "do_stable_layer_norm": true,
41
+ "eos_token_id": 2,
42
+ "feat_extract_activation": "gelu",
43
+ "feat_extract_dropout": 0.0,
44
+ "feat_extract_norm": "layer",
45
+ "feat_proj_dropout": 0.1,
46
+ "final_dropout": 0.1,
47
+ "gradient_checkpointing": false,
48
+ "hidden_act": "gelu",
49
+ "hidden_dropout": 0.1,
50
+ "hidden_dropout_prob": 0.1,
51
+ "hidden_size": 1024,
52
+ "initializer_range": 0.02,
53
+ "intermediate_size": 4096,
54
+ "layer_norm_eps": 1e-05,
55
+ "layerdrop": 0.1,
56
+ "mask_feature_length": 10,
57
+ "mask_feature_prob": 0.0,
58
+ "mask_time_length": 10,
59
+ "mask_time_prob": 0.05,
60
+ "model_type": "wav2vec2",
61
+ "num_attention_heads": 16,
62
+ "num_conv_pos_embedding_groups": 16,
63
+ "num_conv_pos_embeddings": 128,
64
+ "num_feat_extract_layers": 7,
65
+ "num_hidden_layers": 24,
66
+ "pad_token_id": 0,
67
+ "transformers_version": "4.21.1",
68
+ "vocab_size": 32
69
+ }
 
 
 
 
 
 
 
hyperparams.yaml CHANGED
@@ -1,22 +1,24 @@
1
  # ################################
2
- # Model: wav2vec2 + DNN + CTC/Attention
3
  # Augmentation: SpecAugment
4
- # Authors: Titouan Parcollet 2021
 
 
5
  # ################################
6
 
7
- sample_rate: 16000
8
- wav2vec2_hub: facebook/wav2vec2-large-it-voxpopuli
9
-
10
  # BPE parameters
11
  token_type: unigram # ["unigram", "bpe", "char"]
12
  character_coverage: 1.0
13
 
14
  # Model parameters
15
- activation: !name:torch.nn.LeakyReLU
16
- dnn_layers: 2
17
  dnn_neurons: 1024
18
- emb_size: 128
19
- dec_neurons: 1024
 
 
 
 
20
 
21
  # Outputs
22
  output_neurons: 1000 # BPE size, index(blank/eos/bos) = 0
@@ -26,93 +28,63 @@ output_neurons: 1000 # BPE size, index(blank/eos/bos) = 0
26
  blank_index: 0
27
  bos_index: 1
28
  eos_index: 2
29
- min_decode_ratio: 0.0
30
- max_decode_ratio: 1.0
31
- beam_size: 10
32
- eos_threshold: 1.5
33
- using_max_attn_shift: True
34
- max_attn_shift: 140
35
- ctc_weight_decode: 0.0
36
- temperature: 1.50
37
-
38
- enc: !new:speechbrain.lobes.models.VanillaNN.VanillaNN
39
- input_shape: [null, null, 1024]
40
- activation: !ref <activation>
41
- dnn_blocks: !ref <dnn_layers>
42
- dnn_neurons: !ref <dnn_neurons>
 
 
 
 
 
 
 
 
43
 
44
  wav2vec2: !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
45
- source: !ref <wav2vec2_hub>
46
- output_norm: True
47
- freeze: True
48
- save_path: model_checkpoints
49
-
50
- emb: !new:speechbrain.nnet.embedding.Embedding
51
- num_embeddings: !ref <output_neurons>
52
- embedding_dim: !ref <emb_size>
53
-
54
- dec: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
55
- enc_dim: !ref <dnn_neurons>
56
- input_size: !ref <emb_size>
57
- rnn_type: gru
58
- attn_type: location
59
- hidden_size: 1024
60
- attn_dim: 1024
61
- num_layers: 1
62
- scaling: 1.0
63
- channels: 10
64
- kernel_size: 100
65
- re_init: True
66
- dropout: 0.15
67
 
68
  ctc_lin: !new:speechbrain.nnet.linear.Linear
69
- input_size: !ref <dnn_neurons>
70
- n_neurons: !ref <output_neurons>
71
-
72
- seq_lin: !new:speechbrain.nnet.linear.Linear
73
- input_size: !ref <dec_neurons>
74
- n_neurons: !ref <output_neurons>
75
 
76
  log_softmax: !new:speechbrain.nnet.activations.Softmax
77
- apply_log: True
78
 
79
  ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
80
- blank_index: !ref <blank_index>
81
-
82
- seq_cost: !name:speechbrain.nnet.losses.nll_loss
83
- label_smoothing: 0.1
84
 
85
  asr_model: !new:torch.nn.ModuleList
86
- - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]
87
 
88
  tokenizer: !new:sentencepiece.SentencePieceProcessor
89
 
90
  encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
91
  wav2vec2: !ref <wav2vec2>
92
  enc: !ref <enc>
93
-
94
- decoder: !new:speechbrain.decoders.S2SRNNBeamSearcher
95
- embedding: !ref <emb>
96
- decoder: !ref <dec>
97
- linear: !ref <seq_lin>
98
- ctc_linear: !ref <ctc_lin>
99
- bos_index: !ref <bos_index>
100
- eos_index: !ref <eos_index>
101
- blank_index: !ref <blank_index>
102
- min_decode_ratio: !ref <min_decode_ratio>
103
- max_decode_ratio: !ref <max_decode_ratio>
104
- beam_size: !ref <beam_size>
105
- eos_threshold: !ref <eos_threshold>
106
- using_max_attn_shift: !ref <using_max_attn_shift>
107
- max_attn_shift: !ref <max_attn_shift>
108
- temperature: !ref <temperature>
109
 
110
  modules:
111
- encoder: !ref <encoder>
112
- decoder: !ref <decoder>
 
 
113
 
114
  pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
115
- loadables:
116
- wav2vec2: !ref <wav2vec2>
117
- asr: !ref <asr_model>
118
- tokenizer: !ref <tokenizer>
 
1
  # ################################
2
+ # Model: wav2vec2 + DNN + CTC
3
  # Augmentation: SpecAugment
4
+ # Authors:
5
+ # Sung-Lin Yeh 2021
6
+ # Pooneh Mousavi 2023
7
  # ################################
8
 
 
 
 
9
  # BPE parameters
10
  token_type: unigram # ["unigram", "bpe", "char"]
11
  character_coverage: 1.0
12
 
13
  # Model parameters
14
+ # activation: !name:torch.nn.LeakyReLU
 
15
  dnn_neurons: 1024
16
+ wav2vec_output_dim: 1024
17
+ dropout: 0.15
18
+
19
+ sample_rate: 16000
20
+
21
+ wav2vec2_hub: facebook/wav2vec2-large-lv60
22
 
23
  # Outputs
24
  output_neurons: 1000 # BPE size, index(blank/eos/bos) = 0
 
28
  blank_index: 0
29
  bos_index: 1
30
  eos_index: 2
31
+
32
+ enc: !new:speechbrain.nnet.containers.Sequential
33
+ input_shape: [null, null, !ref <wav2vec_output_dim>]
34
+ linear1: !name:speechbrain.nnet.linear.Linear
35
+ n_neurons: !ref <dnn_neurons>
36
+ bias: True
37
+ bn1: !name:speechbrain.nnet.normalization.BatchNorm1d
38
+ activation: !new:torch.nn.LeakyReLU
39
+ drop: !new:torch.nn.Dropout
40
+ p: !ref <dropout>
41
+ linear2: !name:speechbrain.nnet.linear.Linear
42
+ n_neurons: !ref <dnn_neurons>
43
+ bias: True
44
+ bn2: !name:speechbrain.nnet.normalization.BatchNorm1d
45
+ activation2: !new:torch.nn.LeakyReLU
46
+ drop2: !new:torch.nn.Dropout
47
+ p: !ref <dropout>
48
+ linear3: !name:speechbrain.nnet.linear.Linear
49
+ n_neurons: !ref <dnn_neurons>
50
+ bias: True
51
+ bn3: !name:speechbrain.nnet.normalization.BatchNorm1d
52
+ activation3: !new:torch.nn.LeakyReLU
53
 
54
  wav2vec2: !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
55
+ source: !ref <wav2vec2_hub>
56
+ output_norm: True
57
+ freeze: True
58
+ save_path: wav2vec2_checkpoint
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
  ctc_lin: !new:speechbrain.nnet.linear.Linear
61
+ input_size: !ref <dnn_neurons>
62
+ n_neurons: !ref <output_neurons>
 
 
 
 
63
 
64
  log_softmax: !new:speechbrain.nnet.activations.Softmax
65
+ apply_log: True
66
 
67
  ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
68
+ blank_index: !ref <blank_index>
 
 
 
69
 
70
  asr_model: !new:torch.nn.ModuleList
71
+ - [!ref <enc>, !ref <ctc_lin>]
72
 
73
  tokenizer: !new:sentencepiece.SentencePieceProcessor
74
 
75
  encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
76
  wav2vec2: !ref <wav2vec2>
77
  enc: !ref <enc>
78
+ ctc_lin: !ref <ctc_lin>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
  modules:
81
+ encoder: !ref <encoder>
82
+
83
+ decoding_function: !name:speechbrain.decoders.ctc_greedy_decode
84
+ blank_id: !ref <blank_index>
85
 
86
  pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
87
+ loadables:
88
+ wav2vec2: !ref <wav2vec2>
89
+ asr: !ref <asr_model>
90
+ tokenizer: !ref <tokenizer>
preprocessor_config.json CHANGED
@@ -1,9 +1,8 @@
1
  {
2
- "do_normalize": true,
3
- "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
- "feature_size": 1,
5
- "padding_side": "right",
6
- "padding_value": 0,
7
- "return_attention_mask": true,
8
- "sampling_rate": 16000
9
- }
 
1
  {
2
+ "do_normalize": true,
3
+ "feature_size": 1,
4
+ "padding_side": "right",
5
+ "padding_value": 0.0,
6
+ "return_attention_mask": true,
7
+ "sampling_rate": 16000
8
+ }