Mirco commited on
Commit
3b23d0f
1 Parent(s): d5536b7

model upload

Browse files
Files changed (7) hide show
  1. .gitattributes +3 -0
  2. README.md +115 -0
  3. example_mandarin.wav +0 -0
  4. hyperparams.yaml +113 -0
  5. model.ckpt +3 -0
  6. tokenizer.ckpt +3 -0
  7. wav2vec2.ckpt +3 -0
.gitattributes CHANGED
@@ -14,3 +14,6 @@
14
  *.pb filter=lfs diff=lfs merge=lfs -text
15
  *.pt filter=lfs diff=lfs merge=lfs -text
16
  *.pth filter=lfs diff=lfs merge=lfs -text
 
 
 
14
  *.pb filter=lfs diff=lfs merge=lfs -text
15
  *.pt filter=lfs diff=lfs merge=lfs -text
16
  *.pth filter=lfs diff=lfs merge=lfs -text
17
+ model.ckpt filter=lfs diff=lfs merge=lfs -text
18
+ tokenizer.ckpt filter=lfs diff=lfs merge=lfs -text
19
+ wav2vec2.ckpt filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ thumbnail:
4
+ tags:
5
+ - ASR
6
+ - CTC
7
+ - Attention
8
+ - Transformers
9
+ - pytorch
10
+ license: "apache-2.0"
11
+ datasets:
12
+ - aishell
13
+ metrics:
14
+ - wer
15
+ - cer
16
+ ---
17
+
18
+ <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
19
+ <br/><br/>
20
+
21
+ # Transformer for AISHELL (Mandarin Chinese)
22
+
23
+ This repository provides all the necessary tools to perform automatic speech
24
+ recognition from an end-to-end system pretrained on AISHELL (Mandarin Chinese)
25
+ within SpeechBrain. For a better experience, we encourage you to learn more about
26
+ [SpeechBrain](https://speechbrain.github.io).
27
+
28
+ The performance of the model is the following:
29
+
30
+ | Release | Dev CER | Test CER | GPUs | Full Results |
31
+ |:-------------:|:--------------:|:--------------:|:--------:|:--------:|
32
+ | 05-03-21 | 5.60 | 6.04 | 2xV100 32GB | [Google Drive](https://drive.google.com/drive/folders/1zlTBib0XEwWeyhaXDXnkqtPsIBI18Uzs?usp=sharing)|
33
+
34
+
35
+
36
+ ## Pipeline description
37
+
38
+ This ASR system is composed of 2 different but linked blocks:
39
+ - Tokenizer (unigram) that transforms words into subword units and trained with
40
+ the train transcriptions of LibriSpeech.
41
+ - Acoustic model made of a transformer encoder and a joint decoder with CTC +
42
+ transformer. Hence, the decoding also incorporates the CTC probabilities.
43
+
44
+ To Train this system from scratch, [see our SpeechBrain recipe](https://github.com/speechbrain/speechbrain/tree/develop/recipes/AISHELL-1).
45
+
46
+
47
+ ## Install SpeechBrain
48
+
49
+ First of all, please install SpeechBrain with the following command:
50
+
51
+ ```
52
+ pip install speechbrain
53
+ ```
54
+
55
+ Please notice that we encourage you to read our tutorials and learn more about
56
+ [SpeechBrain](https://speechbrain.github.io).
57
+
58
+ ### Transcribing your own audio files (in English)
59
+
60
+ ```python
61
+ from speechbrain.pretrained import EncoderDecoderASR
62
+
63
+ asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-aishell", savedir="pretrained_models/asr-transformer-aishell")
64
+ asr_model.transcribe_file("speechbrain/asr-transformer-aishell/example_mandarin.wav")
65
+
66
+ ```
67
+
68
+ ### Inference on GPU
69
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
70
+
71
+ ### Training
72
+ The model was trained with SpeechBrain (Commit hash: '986a2175').
73
+ To train it from scratch follow these steps:
74
+ 1. Clone SpeechBrain:
75
+ ```bash
76
+ git clone https://github.com/speechbrain/speechbrain/
77
+ ```
78
+ 2. Install it:
79
+ ```bash
80
+ cd speechbrain
81
+ pip install -r requirements.txt
82
+ pip install -e .
83
+ ```
84
+
85
+ 3. Run Training:
86
+ ```bash
87
+ cd recipes/AISHELL-1/ASR/transformer/
88
+ python train.py hparams/train_ASR_transformer.yaml --data_folder=your_data_folder
89
+ ```
90
+
91
+ You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1QU18YoauzLOXueogspT0CgR5bqJ6zFfu?usp=sharing).
92
+
93
+ ### Limitations
94
+ The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
95
+
96
+
97
+ # **About SpeechBrain**
98
+ - Website: https://speechbrain.github.io/
99
+ - Code: https://github.com/speechbrain/speechbrain/
100
+ - HuggingFace: https://huggingface.co/speechbrain/
101
+
102
+
103
+ # **Citing SpeechBrain**
104
+ Please, cite SpeechBrain if you use it for your research or business.
105
+
106
+ ```bibtex
107
+ @misc{speechbrain,
108
+ title={SpeechBrain: A General-Purpose Speech Toolkit},
109
+ author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
110
+ year={2021},
111
+ eprint={2106.04624},
112
+ archivePrefix={arXiv},
113
+ primaryClass={eess.AS}
114
+ }
115
+ ```
example_mandarin.wav ADDED
Binary file (69 kB). View file
hyperparams.yaml ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ############################################################################
2
+ # Model: E2E ASR with Transformer
3
+ # Encoder: Transformer Encoder
4
+ # Decoder: Transformer Decoder + (CTC/ATT joint) beamsearch
5
+ # Tokens: BPE with unigram
6
+ # losses: CTC + KLdiv (Label Smoothing loss)
7
+ # Training: AISHELL-1
8
+ # Authors: Jianyuan Zhong, Titouan Parcollet
9
+ # ############################################################################
10
+
11
+ # Feature parameters
12
+ sample_rate: 16000
13
+ n_fft: 400
14
+ n_mels: 80
15
+ wav2vec2_hub: facebook/wav2vec2-large-it-voxpopuli
16
+
17
+ ####################### Model parameters ###########################
18
+ # Transformer
19
+ d_model: 256
20
+ nhead: 4
21
+ num_encoder_layers: 2
22
+ num_decoder_layers: 6
23
+ d_ffn: 2048
24
+ transformer_dropout: 0.1
25
+ activation: !name:torch.nn.GELU
26
+ output_neurons: 5000
27
+ vocab_size: 5000
28
+
29
+ # Outputs
30
+ blank_index: 0
31
+ label_smoothing: 0.1
32
+ pad_index: 0
33
+ bos_index: 1
34
+ eos_index: 2
35
+ unk_index: 0
36
+
37
+ # Decoding parameters
38
+ min_decode_ratio: 0.0
39
+ max_decode_ratio: 1.0 # 1.0
40
+ valid_search_interval: 10
41
+ valid_beam_size: 10
42
+ test_beam_size: 10
43
+ ctc_weight_decode: 0.40
44
+
45
+ ############################## models ################################
46
+
47
+ wav2vec2: !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
48
+ source: !ref <wav2vec2_hub>
49
+ output_norm: True
50
+ freeze: True
51
+ pretrain: False # Pretraining is managed by the SpeechBrain pre-trainer.
52
+ save_path: !ref <save_folder>/wav2vec2_checkpoint
53
+
54
+ Transformer: !new:speechbrain.lobes.models.transformer.TransformerASR.TransformerASR # yamllint disable-line rule:line-length
55
+ input_size: 1024
56
+ tgt_vocab: !ref <output_neurons>
57
+ d_model: !ref <d_model>
58
+ nhead: !ref <nhead>
59
+ num_encoder_layers: !ref <num_encoder_layers>
60
+ num_decoder_layers: !ref <num_decoder_layers>
61
+ d_ffn: !ref <d_ffn>
62
+ dropout: !ref <transformer_dropout>
63
+ activation: !ref <activation>
64
+ normalize_before: True
65
+
66
+
67
+ ctc_lin: !new:speechbrain.nnet.linear.Linear
68
+ input_size: !ref <d_model>
69
+ n_neurons: !ref <output_neurons>
70
+
71
+ seq_lin: !new:speechbrain.nnet.linear.Linear
72
+ input_size: !ref <d_model>
73
+ n_neurons: !ref <output_neurons>
74
+
75
+ tokenizer: !new:sentencepiece.SentencePieceProcessor
76
+
77
+ asr_model: !new:torch.nn.ModuleList
78
+ - [!ref <Transformer>, !ref <seq_lin>, !ref <ctc_lin>]
79
+
80
+ # Here, we extract the encoder from the Transformer model
81
+ Tencoder: !new:speechbrain.lobes.models.transformer.TransformerASR.EncoderWrapper
82
+ transformer: !ref <Transformer>
83
+
84
+ # We compose the inference (encoder) pipeline.
85
+ encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
86
+ input_shape: [null, null, !ref <n_mels>]
87
+ wav2vec2: !ref <wav2vec2>
88
+ transformer_encoder: !ref <Tencoder>
89
+
90
+ decoder: !new:speechbrain.decoders.S2STransformerBeamSearch
91
+ modules: [!ref <Transformer>, !ref <seq_lin>, !ref <ctc_lin>]
92
+ bos_index: !ref <bos_index>
93
+ eos_index: !ref <eos_index>
94
+ blank_index: !ref <blank_index>
95
+ min_decode_ratio: !ref <min_decode_ratio>
96
+ max_decode_ratio: !ref <max_decode_ratio>
97
+ beam_size: !ref <test_beam_size>
98
+ ctc_weight: !ref <ctc_weight_decode>
99
+ using_eos_threshold: False
100
+ length_normalization: True
101
+
102
+ modules:
103
+ encoder: !ref <encoder>
104
+ decoder: !ref <decoder>
105
+
106
+ log_softmax: !new:torch.nn.LogSoftmax
107
+ dim: -1
108
+
109
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
110
+ loadables:
111
+ wav2vect2: !ref <wav2vect2>
112
+ model: !ref <model>
113
+ tokenizer: !ref <tokenizer>
model.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3edcc685cc45c2775ff90eaad6631e8db3f7de2154e479818a86d3559a6b7bee
3
+ size 67484671
tokenizer.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f0478ccc6dac61ce0e6149a84e531ff7d300b133d5717cc9d531b00837ac444
3
+ size 300111
wav2vec2.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:98dc840993f8ddd3151611909be73a740d2be8bdf3621437e8ffd738e2a3a6b8
3
+ size 1261930757