dwgnr commited on
Commit
0aed1a2
1 Parent(s): d8e97b0

add wav2vec2 model

Browse files
Files changed (8) hide show
  1. .gitattributes +3 -0
  2. README.md +111 -0
  3. asr.ckpt +3 -0
  4. config.json +76 -0
  5. hyperparams.yaml +88 -0
  6. preprocessor_config.json +8 -0
  7. tokenizer.ckpt +3 -0
  8. wav2vec2.ckpt +3 -0
.gitattributes CHANGED
@@ -30,3 +30,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
30
  *.zip filter=lfs diff=lfs merge=lfs -text
31
  *.zst filter=lfs diff=lfs merge=lfs -text
32
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
30
  *.zip filter=lfs diff=lfs merge=lfs -text
31
  *.zst filter=lfs diff=lfs merge=lfs -text
32
  *tfevents* filter=lfs diff=lfs merge=lfs -text
33
+ asr.ckpt filter=lfs diff=lfs merge=lfs -text
34
+ tokenizer.ckpt filter=lfs diff=lfs merge=lfs -text
35
+ wav2vec2.ckpt filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,114 @@
1
  ---
 
 
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ thumbnail: null
5
+ pipeline_tag: automatic-speech-recognition
6
+ tags:
7
+ - CTC
8
+ - pytorch
9
+ - speechbrain
10
  license: apache-2.0
11
+ datasets:
12
+ - switchboard
13
+ metrics:
14
+ - wer
15
+ - ser
16
+
17
  ---
18
+
19
+ <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
20
+ <br/><br/>
21
+
22
+ # wav2vec 2.0 with CTC/Attention trained on Switchboard (No LM)
23
+
24
+ This repository provides all the necessary tools to perform automatic speech
25
+ recognition from an end-to-end system pretrained on the Switchboard corpus within
26
+ SpeechBrain. For a better experience, we encourage you to learn more about
27
+ [SpeechBrain](https://speechbrain.github.io).
28
+
29
+ The performance of the model is the following:
30
+
31
+ | Release | Swbd SER | Callhome SER | Eval2000 SER | Swbd WER | Callhome WER | Eval2000 WER | GPUs |
32
+ |:--------:|:--------:|:------------:|:------------:|:--------:|:------------:|:------------:|:-----------:|
33
+ | 17-09-22 | 48.60 | 55.76 | 52.96 | 8 .76 | 14.67 | 11.78 | 4xA100 40GB |
34
+
35
+ ## Pipeline description
36
+
37
+ This ASR system is composed of 2 different but linked blocks:
38
+ - Tokenizer (unigram) that transforms words into subword units trained on the Switchboard training transcripts and the Fisher corpus.
39
+ - Acoustic model (wav2vec2.0 + CTC). A pretrained wav2vec 2.0 model ([facebook/wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60)) is combined with two DNN layers and finetuned on Switchboard
40
+ The obtained final acoustic representation is given to the CTC greedy decoder.
41
+
42
+ The system is trained with recordings sampled at 16kHz (single channel).
43
+ The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *transcribe_file* if needed.
44
+
45
+ ## Install SpeechBrain
46
+
47
+ First of all, please install tranformers and SpeechBrain with the following command:
48
+
49
+ ```
50
+ pip install speechbrain transformers
51
+ ```
52
+
53
+ Please notice that we encourage you to read our tutorials and learn more about
54
+ [SpeechBrain](https://speechbrain.github.io).
55
+
56
+ ### Transcribing your own audio files
57
+
58
+ ```python
59
+ from speechbrain.pretrained import EncoderASR
60
+
61
+ asr_model = EncoderASR.from_hparams(source="speechbrain/asr-wav2vec2-switchboard", savedir="pretrained_models/asr-wav2vec2-switchboard")
62
+ asr_model.transcribe_file('path/to/audiofile')
63
+
64
+ ```
65
+ ### Inference on GPU
66
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
67
+
68
+ ### Training
69
+ The model was trained with SpeechBrain (Commit hash: '70904d0').
70
+ To train it from scratch follow these steps:
71
+
72
+ 1. Clone SpeechBrain:
73
+ ```bash
74
+ git clone https://github.com/speechbrain/speechbrain/
75
+ ```
76
+ 2. Install it:
77
+ ```bash
78
+ cd speechbrain
79
+ pip install -r requirements.txt
80
+ pip install -e .
81
+ ```
82
+
83
+ 3. Run Training:
84
+ ```bash
85
+ cd recipes/Switchboard/ASR/CTC/
86
+ python train_with_wav2vec.py hparams/train_with_wav2vec.yaml --data_folder=your_data_folder
87
+ ```
88
+
89
+ ### Limitations
90
+ The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
91
+
92
+ #### Referencing SpeechBrain
93
+
94
+ ```
95
+ @misc{SB2021,
96
+ author = {Ravanelli, Mirco and Parcollet, Titouan and Rouhe, Aku and Plantinga, Peter and Rastorgueva, Elena and Lugosch, Loren and Dawalatabad, Nauman and Ju-Chieh, Chou and Heba, Abdel and Grondin, Francois and Aris, William and Liao, Chien-Feng and Cornell, Samuele and Yeh, Sung-Lin and Na, Hwidong and Gao, Yan and Fu, Szu-Wei and Subakan, Cem and De Mori, Renato and Bengio, Yoshua },
97
+ title = {SpeechBrain},
98
+ year = {2021},
99
+ publisher = {GitHub},
100
+ journal = {GitHub repository},
101
+ howpublished = {\\\\url{https://github.com/speechbrain/speechbrain}},
102
+ }
103
+ ```
104
+
105
+ #### About SpeechBrain
106
+ SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to be simple, extremely flexible, and user-friendly. Competitive or state-of-the-art performance is obtained in various domains.
107
+
108
+ Website: https://speechbrain.github.io/
109
+
110
+ GitHub: https://github.com/speechbrain/speechbrain
111
+
112
+
113
+
114
+
asr.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91d3f9494dd14d52e67de937f7aeec89f1fb6f3376827d4f66a3ae45dfc03166
3
+ size 16751264
config.json ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "speechbrain_interface": "EncoderASR",
3
+ "activation_dropout": 0.0,
4
+ "apply_spec_augment": true,
5
+ "architectures": [
6
+ "Wav2Vec2Model"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "bos_token_id": 1,
10
+ "conv_bias": true,
11
+ "conv_dim": [
12
+ 512,
13
+ 512,
14
+ 512,
15
+ 512,
16
+ 512,
17
+ 512,
18
+ 512
19
+ ],
20
+ "conv_kernel": [
21
+ 10,
22
+ 3,
23
+ 3,
24
+ 3,
25
+ 3,
26
+ 2,
27
+ 2
28
+ ],
29
+ "conv_stride": [
30
+ 5,
31
+ 2,
32
+ 2,
33
+ 2,
34
+ 2,
35
+ 2,
36
+ 2
37
+ ],
38
+ "ctc_loss_reduction": "sum",
39
+ "ctc_zero_infinity": false,
40
+ "do_stable_layer_norm": true,
41
+ "eos_token_id": 2,
42
+ "feat_extract_activation": "gelu",
43
+ "feat_extract_dropout": 0.0,
44
+ "feat_extract_norm": "layer",
45
+ "feat_proj_dropout": 0.1,
46
+ "final_dropout": 0.0,
47
+ "gradient_checkpointing": false,
48
+ "hidden_act": "gelu",
49
+ "hidden_dropout": 0.1,
50
+ "hidden_size": 1024,
51
+ "initializer_range": 0.02,
52
+ "intermediate_size": 4096,
53
+ "layer_norm_eps": 1e-05,
54
+ "layerdrop": 0.1,
55
+ "mask_channel_length": 10,
56
+ "mask_channel_min_space": 1,
57
+ "mask_channel_other": 0.0,
58
+ "mask_channel_prob": 0.0,
59
+ "mask_channel_selection": "static",
60
+ "mask_feature_length": 10,
61
+ "mask_feature_prob": 0.0,
62
+ "mask_time_length": 10,
63
+ "mask_time_min_space": 1,
64
+ "mask_time_other": 0.0,
65
+ "mask_time_prob": 0.075,
66
+ "mask_time_selection": "static",
67
+ "model_type": "wav2vec2",
68
+ "num_attention_heads": 16,
69
+ "num_conv_pos_embedding_groups": 16,
70
+ "num_conv_pos_embeddings": 128,
71
+ "num_feat_extract_layers": 7,
72
+ "num_hidden_layers": 24,
73
+ "pad_token_id": 0,
74
+ "transformers_version": "4.5.1",
75
+ "vocab_size": 32
76
+ }
hyperparams.yaml ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ################################
2
+ # Model: wav2vec2 + DNN + CTC
3
+ # Augmentation: SpecAugment
4
+ # Authors: Titouan Parcollet 2021, Dominik Wagner 2022
5
+ # ################################
6
+
7
+ wav2vec2_hub: facebook/wav2vec2-large-lv60
8
+ sample_rate: 16000
9
+
10
+ # BPE parameters
11
+ token_type: unigram # ["unigram", "bpe", "char"]
12
+ character_coverage: 1.0
13
+
14
+ # Model parameters
15
+ wav2vec_output_dim: 1024
16
+ dnn_neurons: 1024
17
+ freeze_wav2vec: False
18
+ dropout: 0.15
19
+
20
+ # Outputs
21
+ output_neurons: 1000 # BPE size, index(blank/eos/bos) = 0
22
+
23
+ # Decoding parameters
24
+ # Be sure that the bos and eos index match with the BPEs ones
25
+ blank_index: 0
26
+ bos_index: 1
27
+ eos_index: 2
28
+
29
+ enc: !new:speechbrain.nnet.containers.Sequential
30
+ input_shape: [null, null, !ref <wav2vec_output_dim>]
31
+ linear1: !name:speechbrain.nnet.linear.Linear
32
+ n_neurons: !ref <dnn_neurons>
33
+ bias: True
34
+ bn1: !name:speechbrain.nnet.normalization.BatchNorm1d
35
+ activation: !new:torch.nn.LeakyReLU
36
+ drop: !new:torch.nn.Dropout
37
+ p: !ref <dropout>
38
+ linear2: !name:speechbrain.nnet.linear.Linear
39
+ n_neurons: !ref <dnn_neurons>
40
+ bias: True
41
+ bn2: !name:speechbrain.nnet.normalization.BatchNorm1d
42
+ activation2: !new:torch.nn.LeakyReLU
43
+ drop2: !new:torch.nn.Dropout
44
+ p: !ref <dropout>
45
+ linear3: !name:speechbrain.nnet.linear.Linear
46
+ n_neurons: !ref <dnn_neurons>
47
+ bias: True
48
+ bn3: !name:speechbrain.nnet.normalization.BatchNorm1d
49
+ activation3: !new:torch.nn.LeakyReLU
50
+
51
+ wav2vec2: !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
52
+ source: !ref <wav2vec2_hub>
53
+ output_norm: True
54
+ freeze: !ref <freeze_wav2vec>
55
+ save_path: wav2vec2_checkpoint
56
+
57
+ ctc_lin: !new:speechbrain.nnet.linear.Linear
58
+ input_size: !ref <dnn_neurons>
59
+ n_neurons: !ref <output_neurons>
60
+
61
+ log_softmax: !new:speechbrain.nnet.activations.Softmax
62
+ apply_log: True
63
+
64
+ ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
65
+ blank_index: !ref <blank_index>
66
+
67
+ asr_model: !new:torch.nn.ModuleList
68
+ - [!ref <enc>, !ref <ctc_lin>]
69
+
70
+ tokenizer: !new:sentencepiece.SentencePieceProcessor
71
+
72
+ encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
73
+ wav2vec2: !ref <wav2vec2>
74
+ enc: !ref <enc>
75
+ ctc_lin: !ref <ctc_lin>
76
+
77
+ modules:
78
+ encoder: !ref <encoder>
79
+
80
+ decoding_function: !name:speechbrain.decoders.ctc_greedy_decode
81
+ blank_id: !ref <blank_index>
82
+
83
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
84
+ loadables:
85
+ wav2vec2: !ref <wav2vec2>
86
+ asr: !ref <asr_model>
87
+ tokenizer: !ref <tokenizer>
88
+
preprocessor_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_size": 1,
4
+ "padding_side": "right",
5
+ "padding_value": 0.0,
6
+ "return_attention_mask": true,
7
+ "sampling_rate": 16000
8
+ }
tokenizer.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17667ee5f627ce940fb671258a9340e12875fa9b02476061112df250bee538f4
3
+ size 253486
wav2vec2.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:06f7c5b9cb72a46f606315c49232b5bb4a7d055196c843a95445c74402fa1794
3
+ size 1261923125