mirco commited on
Commit
e19217e
1 Parent(s): 49f6cc6

upload model

Browse files
.gitattributes CHANGED
@@ -14,3 +14,5 @@
14
  *.pb filter=lfs diff=lfs merge=lfs -text
15
  *.pt filter=lfs diff=lfs merge=lfs -text
16
  *.pth filter=lfs diff=lfs merge=lfs -text
 
 
14
  *.pb filter=lfs diff=lfs merge=lfs -text
15
  *.pt filter=lfs diff=lfs merge=lfs -text
16
  *.pth filter=lfs diff=lfs merge=lfs -text
17
+ classifier.ckpt filter=lfs diff=lfs merge=lfs -text
18
+ embedding_model.ckpt filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ thumbnail:
4
+ tags:
5
+ - speechbrain
6
+ - embeddings
7
+ - Sound
8
+ - Keywords
9
+ - Keyword Spotting
10
+ - pytorch
11
+ - ECAPA-TDNN
12
+ - TDNN
13
+ - Command Recognition
14
+ license: "apache-2.0"
15
+ datasets:
16
+ - Urbansound8k
17
+ metrics:
18
+ - Accuracy
19
+
20
+ ---
21
+
22
+ <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
23
+ <br/><br/>
24
+
25
+ # Command Recognition with ECAPA embeddings on UrbanSoudnd8k
26
+
27
+ This repository provides all the necessary tools to perform sound recognition with SpeechBrain using a model pretrained on UrbanSound8k.
28
+ You can download the dataset [here](https://urbansounddataset.weebly.com/urbansound8k.html)
29
+ The provided system can recognize the following 10 keywords:
30
+ ```
31
+ dog_bark, children_playing, air_conditioner, street_music, gun_shot, siren, engine_idling, jackhammer, drilling, car_horn
32
+ ```
33
+
34
+ For a better experience, we encourage you to learn more about
35
+ [SpeechBrain](https://speechbrain.github.io). The given model performance on the test set is:
36
+
37
+ | Release | Accuracy 1-fold (%)
38
+ |:-------------:|:--------------:|
39
+ | 04-06-21 | 75.5 |
40
+
41
+
42
+ ## Pipeline description
43
+ This system is composed of a ECAPA model coupled with statistical pooling. A classifier, trained with Categorical Cross-Entropy Loss, is applied on top of that.
44
+
45
+ ## Install SpeechBrain
46
+
47
+ First of all, please install SpeechBrain with the following command:
48
+
49
+ ```
50
+ pip install speechbrain
51
+ ```
52
+
53
+ Please notice that we encourage you to read our tutorials and learn more about
54
+ [SpeechBrain](https://speechbrain.github.io).
55
+
56
+ ### Perform Sound Recognition
57
+
58
+ ```python
59
+ import torchaudio
60
+ from speechbrain.pretrained import EncoderClassifier
61
+ classifier = EncoderClassifier.from_hparams(source="speechbrain/urbansound8k_ecapa", savedir="pretrained_models/gurbansound8k_ecapa")
62
+ out_prob, score, index, text_lab = classifier.classify_file('speechbrain/urbansound8k_ecapa/dog_bark.wav')
63
+ print(text_lab)
64
+ ```
65
+
66
+ ### Inference on GPU
67
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
68
+
69
+ ### Training
70
+ The model was trained with SpeechBrain (8cab8b0c).
71
+ To train it from scratch follows these steps:
72
+ 1. Clone SpeechBrain:
73
+ ```bash
74
+ git clone https://github.com/speechbrain/speechbrain/
75
+ ```
76
+ 2. Install it:
77
+ ```
78
+ cd speechbrain
79
+ pip install -r requirements.txt
80
+ pip install -e .
81
+ ```
82
+
83
+ 3. Run Training:
84
+ ```
85
+ cd recipes/UrbanSound8k/SoundClassification
86
+ python train.py hparams/train_ecapa_tdnn.yaml --data_folder=your_data_folder
87
+ ```
88
+
89
+ You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1sItfg_WNuGX6h2dCs8JTGq2v2QoNTaUg?usp=sharing).
90
+
91
+ ### Limitations
92
+ The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
93
+
94
+ #### Referencing ECAPA
95
+ ```@inproceedings{DBLP:conf/interspeech/DesplanquesTD20,
96
+ author = {Brecht Desplanques and
97
+ Jenthe Thienpondt and
98
+ Kris Demuynck},
99
+ editor = {Helen Meng and
100
+ Bo Xu and
101
+ Thomas Fang Zheng},
102
+ title = {{ECAPA-TDNN:} Emphasized Channel Attention, Propagation and Aggregation
103
+ in {TDNN} Based Speaker Verification},
104
+ booktitle = {Interspeech 2020},
105
+ pages = {3830--3834},
106
+ publisher = {{ISCA}},
107
+ year = {2020},
108
+ }
109
+ ```
110
+
111
+ #### Referencing UrbanSound
112
+ ```@inproceedings{Salamon:UrbanSound:ACMMM:14,
113
+ Author = {Salamon, J. and Jacoby, C. and Bello, J. P.},
114
+ Booktitle = {22nd {ACM} International Conference on Multimedia (ACM-MM'14)},
115
+ Month = {Nov.},
116
+ Pages = {1041--1044},
117
+ Title = {A Dataset and Taxonomy for Urban Sound Research},
118
+ Year = {2014}}
119
+ ```
120
+
121
+
122
+
123
+ # **Citing SpeechBrain**
124
+ Please, cite SpeechBrain if you use it for your research or business.
125
+
126
+
127
+ ```bibtex
128
+ @misc{speechbrain,
129
+ title={{SpeechBrain}: A General-Purpose Speech Toolkit},
130
+ author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
131
+ year={2021},
132
+ eprint={2106.04624},
133
+ archivePrefix={arXiv},
134
+ primaryClass={eess.AS},
135
+ note={arXiv:2106.04624}
136
+ }
137
+ ```
classifier.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7293c782d5c314c11ed43ce64100e7f13aa7a1e3d83327488ad98560f77f9b3e
3
+ size 35371
embedding_model.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b88a7bd3146689fc2837f9911fb88a40fd30d4c71ba65c461c7532a49f21080f
3
+ size 83310835
hyperparams.yaml ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ############################################################################
2
+ # Model: ECAPA-TDNN for Language Identification
3
+ # ############################################################################
4
+
5
+ # Pretrain folder (HuggingFace)
6
+ pretrained_path: speechbrain/lang-id-commonlanguage_ecapa
7
+
8
+ # Feature parameters
9
+ n_mels: 80
10
+
11
+ # Output parameters
12
+ out_n_neurons: 45 # Possible languages in the dataset
13
+
14
+
15
+ # Model params
16
+ compute_features: !new:speechbrain.lobes.features.Fbank
17
+ n_mels: !ref <n_mels>
18
+
19
+ mean_var_norm: !new:speechbrain.processing.features.InputNormalization
20
+ norm_type: sentence
21
+ std_norm: False
22
+
23
+ embedding_model: !new:speechbrain.lobes.models.ECAPA_TDNN.ECAPA_TDNN
24
+ input_size: !ref <n_mels>
25
+ channels: [1024, 1024, 1024, 1024, 3072]
26
+ kernel_sizes: [5, 3, 3, 1, 1]
27
+ dilations: [1, 2, 3, 4, 1]
28
+ attention_channels: 128
29
+ lin_neurons: 192
30
+
31
+ classifier: !new:speechbrain.lobes.models.ECAPA_TDNN.Classifier
32
+ input_size: 192
33
+ out_neurons: !ref <out_n_neurons>
34
+
35
+ modules:
36
+ compute_features: !ref <compute_features>
37
+ mean_var_norm: !ref <mean_var_norm>
38
+ embedding_model: !ref <embedding_model>
39
+ classifier: !ref <classifier>
40
+
41
+ label_encoder: !new:speechbrain.dataio.encoder.CategoricalEncoder
42
+
43
+
44
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
45
+ loadables:
46
+ embedding_model: !ref <embedding_model>
47
+ classifier: !ref <classifier>
48
+ label_encoder: !ref <label_encoder>
49
+ paths:
50
+ embedding_model: !ref <pretrained_path>/embedding_model.ckpt
51
+ classifier: !ref <pretrained_path>/classifier.ckpt
52
+ label_encoder: !ref <pretrained_path>/label_encoder.txt
language_encoder.txt ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 'Arabic' => 0
2
+ 'Portuguese' => 1
3
+ 'Romansh_Sursilvan' => 2
4
+ 'Japanese' => 3
5
+ 'Ukranian' => 4
6
+ 'German' => 5
7
+ 'Chinese_China' => 6
8
+ 'Welsh' => 7
9
+ 'English' => 8
10
+ 'Kabyle' => 9
11
+ 'Kyrgyz' => 10
12
+ 'Georgian' => 11
13
+ 'Persian' => 12
14
+ 'French' => 13
15
+ 'Interlingua' => 14
16
+ 'Swedish' => 15
17
+ 'Spanish' => 16
18
+ 'Dhivehi' => 17
19
+ 'Kinyarwanda' => 18
20
+ 'Tatar' => 19
21
+ 'Hakha_Chin' => 20
22
+ 'Tamil' => 21
23
+ 'Greek' => 22
24
+ 'Latvian' => 23
25
+ 'Russian' => 24
26
+ 'Breton' => 25
27
+ 'Catalan' => 26
28
+ 'Maltese' => 27
29
+ 'Slovenian' => 28
30
+ 'Indonesian' => 29
31
+ 'Dutch' => 30
32
+ 'Chinese_Taiwan' => 31
33
+ 'Sakha' => 32
34
+ 'Polish' => 33
35
+ 'Czech' => 34
36
+ 'Romanian' => 35
37
+ 'Mangolian' => 36
38
+ 'Italian' => 37
39
+ 'Chinese_Hongkong' => 38
40
+ 'Estonian' => 39
41
+ 'Basque' => 40
42
+ 'Esperanto' => 41
43
+ 'Frisian' => 42
44
+ 'Turkish' => 43
45
+ 'Chuvash' => 44
46
+ ================
47
+ 'starting_index' => 0