AkshaySg commited on
Commit
efcf093
1 Parent(s): 237d457

Initial Commit

Browse files
Files changed (7) hide show
  1. .gitattributes +9 -19
  2. README.md +225 -0
  3. classifier.ckpt +3 -0
  4. embedding_model.ckpt +3 -0
  5. hyperparams.yaml +52 -0
  6. label_encoder.txt +109 -0
  7. normalizer.ckpt +3 -0
.gitattributes CHANGED
@@ -1,27 +1,17 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
  *.bin.* filter=lfs diff=lfs merge=lfs -text
5
- *.bz2 filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
  *.h5 filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
9
  *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
  *.model filter=lfs diff=lfs merge=lfs -text
12
  *.msgpack filter=lfs diff=lfs merge=lfs -text
13
- *.onnx filter=lfs diff=lfs merge=lfs -text
14
- *.ot filter=lfs diff=lfs merge=lfs -text
15
- *.parquet filter=lfs diff=lfs merge=lfs -text
16
  *.pb filter=lfs diff=lfs merge=lfs -text
17
  *.pt filter=lfs diff=lfs merge=lfs -text
18
  *.pth filter=lfs diff=lfs merge=lfs -text
19
- *.rar filter=lfs diff=lfs merge=lfs -text
20
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21
- *.tar.* filter=lfs diff=lfs merge=lfs -text
22
- *.tflite filter=lfs diff=lfs merge=lfs -text
23
- *.tgz filter=lfs diff=lfs merge=lfs -text
24
- *.xz filter=lfs diff=lfs merge=lfs -text
25
- *.zip filter=lfs diff=lfs merge=lfs -text
26
- *.zstandard filter=lfs diff=lfs merge=lfs -text
27
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
1
  *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
 
4
  *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.arrow filter=lfs diff=lfs merge=lfs -text
10
+ *.ftz filter=lfs diff=lfs merge=lfs -text
11
  *.joblib filter=lfs diff=lfs merge=lfs -text
 
12
  *.model filter=lfs diff=lfs merge=lfs -text
13
  *.msgpack filter=lfs diff=lfs merge=lfs -text
 
 
 
14
  *.pb filter=lfs diff=lfs merge=lfs -text
15
  *.pt filter=lfs diff=lfs merge=lfs -text
16
  *.pth filter=lfs diff=lfs merge=lfs -text
17
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
README.md ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: multilingual
3
+ thumbnail:
4
+ tags:
5
+ - audio-classification
6
+ - speechbrain
7
+ - embeddings
8
+ - Language
9
+ - Identification
10
+ - pytorch
11
+ - ECAPA-TDNN
12
+ - TDNN
13
+ - VoxLingua107
14
+ license: "apache-2.0"
15
+ datasets:
16
+ - VoxLingua107
17
+ metrics:
18
+ - Accuracy
19
+ widget:
20
+ - label: English Sample
21
+ src: https://cdn-media.huggingface.co/speech_samples/LibriSpeech_61-70968-0000.flac
22
+ ---
23
+
24
+ # VoxLingua107 ECAPA-TDNN Spoken Language Identification Model
25
+
26
+ ## Model description
27
+
28
+ This is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain.
29
+ The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition.
30
+
31
+ The model can classify a speech utterance according to the language spoken.
32
+ It covers 107 different languages (
33
+ Abkhazian,
34
+ Afrikaans,
35
+ Amharic,
36
+ Arabic,
37
+ Assamese,
38
+ Azerbaijani,
39
+ Bashkir,
40
+ Belarusian,
41
+ Bulgarian,
42
+ Bengali,
43
+ Tibetan,
44
+ Breton,
45
+ Bosnian,
46
+ Catalan,
47
+ Cebuano,
48
+ Czech,
49
+ Welsh,
50
+ Danish,
51
+ German,
52
+ Greek,
53
+ English,
54
+ Esperanto,
55
+ Spanish,
56
+ Estonian,
57
+ Basque,
58
+ Persian,
59
+ Finnish,
60
+ Faroese,
61
+ French,
62
+ Galician,
63
+ Guarani,
64
+ Gujarati,
65
+ Manx,
66
+ Hausa,
67
+ Hawaiian,
68
+ Hindi,
69
+ Croatian,
70
+ Haitian,
71
+ Hungarian,
72
+ Armenian,
73
+ Interlingua,
74
+ Indonesian,
75
+ Icelandic,
76
+ Italian,
77
+ Hebrew,
78
+ Japanese,
79
+ Javanese,
80
+ Georgian,
81
+ Kazakh,
82
+ Central Khmer,
83
+ Kannada,
84
+ Korean,
85
+ Latin,
86
+ Luxembourgish,
87
+ Lingala,
88
+ Lao,
89
+ Lithuanian,
90
+ Latvian,
91
+ Malagasy,
92
+ Maori,
93
+ Macedonian,
94
+ Malayalam,
95
+ Mongolian,
96
+ Marathi,
97
+ Malay,
98
+ Maltese,
99
+ Burmese,
100
+ Nepali,
101
+ Dutch,
102
+ Norwegian Nynorsk,
103
+ Norwegian,
104
+ Occitan,
105
+ Panjabi,
106
+ Polish,
107
+ Pushto,
108
+ Portuguese,
109
+ Romanian,
110
+ Russian,
111
+ Sanskrit,
112
+ Scots,
113
+ Sindhi,
114
+ Sinhala,
115
+ Slovak,
116
+ Slovenian,
117
+ Shona,
118
+ Somali,
119
+ Albanian,
120
+ Serbian,
121
+ Sundanese,
122
+ Swedish,
123
+ Swahili,
124
+ Tamil,
125
+ Telugu,
126
+ Tajik,
127
+ Thai,
128
+ Turkmen,
129
+ Tagalog,
130
+ Turkish,
131
+ Tatar,
132
+ Ukrainian,
133
+ Urdu,
134
+ Uzbek,
135
+ Vietnamese,
136
+ Waray,
137
+ Yiddish,
138
+ Yoruba,
139
+ Mandarin Chinese).
140
+
141
+ ## Intended uses & limitations
142
+
143
+ The model has two uses:
144
+
145
+ - use 'as is' for spoken language recognition
146
+ - use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data
147
+
148
+ The model is trained on automatically collected YouTube data. For more
149
+ information about the dataset, see [here](http://bark.phon.ioc.ee/voxlingua107/).
150
+
151
+
152
+ #### How to use
153
+
154
+ ```python
155
+ import torchaudio
156
+ from speechbrain.pretrained import EncoderClassifier
157
+ language_id = EncoderClassifier.from_hparams(source="TalTechNLP/voxlingua107-epaca-tdnn", savedir="tmp")
158
+ # Download Thai language sample from Omniglot and cvert to suitable form
159
+ signal = language_id.load_audio("https://omniglot.com/soundfiles/udhr/udhr_th.mp3")
160
+ prediction = language_id.classify_batch(signal)
161
+ print(prediction)
162
+ (tensor([[0.3210, 0.3751, 0.3680, 0.3939, 0.4026, 0.3644, 0.3689, 0.3597, 0.3508,
163
+ 0.3666, 0.3895, 0.3978, 0.3848, 0.3957, 0.3949, 0.3586, 0.4360, 0.3997,
164
+ 0.4106, 0.3886, 0.4177, 0.3870, 0.3764, 0.3763, 0.3672, 0.4000, 0.4256,
165
+ 0.4091, 0.3563, 0.3695, 0.3320, 0.3838, 0.3850, 0.3867, 0.3878, 0.3944,
166
+ 0.3924, 0.4063, 0.3803, 0.3830, 0.2996, 0.4187, 0.3976, 0.3651, 0.3950,
167
+ 0.3744, 0.4295, 0.3807, 0.3613, 0.4710, 0.3530, 0.4156, 0.3651, 0.3777,
168
+ 0.3813, 0.6063, 0.3708, 0.3886, 0.3766, 0.4023, 0.3785, 0.3612, 0.4193,
169
+ 0.3720, 0.4406, 0.3243, 0.3866, 0.3866, 0.4104, 0.4294, 0.4175, 0.3364,
170
+ 0.3595, 0.3443, 0.3565, 0.3776, 0.3985, 0.3778, 0.2382, 0.4115, 0.4017,
171
+ 0.4070, 0.3266, 0.3648, 0.3888, 0.3907, 0.3755, 0.3631, 0.4460, 0.3464,
172
+ 0.3898, 0.3661, 0.3883, 0.3772, 0.9289, 0.3687, 0.4298, 0.4211, 0.3838,
173
+ 0.3521, 0.3515, 0.3465, 0.4772, 0.4043, 0.3844, 0.3973, 0.4343]]), tensor([0.9289]), tensor([94]), ['th'])
174
+ # The scores in the prediction[0] tensor can be interpreted as cosine scores between
175
+ # the languages and the given utterance (i.e., the larger the better)
176
+ # The identified language ISO code is given in prediction[3]
177
+ print(prediction[3])
178
+ ['th']
179
+
180
+ # Alternatively, use the utterance embedding extractor:
181
+ emb = language_id.encode_batch(signal)
182
+ print(emb.shape)
183
+ torch.Size([1, 1, 256])
184
+ ```
185
+
186
+ #### Limitations and bias
187
+
188
+ Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are:
189
+
190
+ - Probably it's accuracy on smaller languages is quite limited
191
+ - Probably it works worse on female speech than male speech (because YouTube data includes much more male speech)
192
+ - Based on subjective experiments, it doesn't work well on speech with a foreign accent
193
+ - Probably it doesn't work well on children's speech and on persons with speech disorders
194
+
195
+
196
+ ## Training data
197
+
198
+ The model is trained on [VoxLingua107](http://bark.phon.ioc.ee/voxlingua107/).
199
+
200
+ VoxLingua107 is a speech dataset for training spoken language identification models.
201
+ The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives.
202
+
203
+ VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours.
204
+ The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.
205
+
206
+ ## Training procedure
207
+
208
+ We used [SpeechBrain](https://github.com/speechbrain/speechbrain) to train the model.
209
+ Training recipe will be published soon.
210
+
211
+ ## Evaluation results
212
+
213
+ Error rate: 7% on the development dataset
214
+
215
+
216
+ ### BibTeX entry and citation info
217
+
218
+ ```bibtex
219
+ @inproceedings{valk2021slt,
220
+ title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
221
+ author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
222
+ booktitle={Proc. IEEE SLT Workshop},
223
+ year={2021},
224
+ }
225
+ ```
classifier.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a70783704ef67dcccd675185f5fb96652b4d0f01b66f67e16281a2c0b1d62bc5
3
+ size 110456
embedding_model.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e947c296c59f36de13db8b4e5c120dd4d75c2d90e0b6aab3aa86d23c38fc2a8d
3
+ size 84480206
hyperparams.yaml ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ pretrained_path: TalTechNLP/voxlingua107-epaca-tdnn
2
+
3
+
4
+ # Feature parameters
5
+ n_mels: 60
6
+ left_frames: 0
7
+ right_frames: 0
8
+ deltas: false
9
+
10
+ # Number of speakers
11
+ out_n_neurons: 107
12
+
13
+ # Functions
14
+ compute_features: !new:speechbrain.lobes.features.Fbank
15
+ n_mels: 60
16
+ left_frames: 0
17
+ right_frames: 0
18
+ deltas: false
19
+
20
+ embedding_model: !new:speechbrain.lobes.models.ECAPA_TDNN.ECAPA_TDNN
21
+ input_size: 60
22
+ channels: [1024, 1024, 1024, 1024, 3072]
23
+ kernel_sizes: [5, 3, 3, 3, 1]
24
+ dilations: [1, 2, 3, 4, 1]
25
+ attention_channels: 128
26
+ lin_neurons: 256
27
+
28
+ classifier: !new:speechbrain.lobes.models.ECAPA_TDNN.Classifier
29
+ input_size: 256
30
+ out_neurons: !ref <out_n_neurons>
31
+
32
+ mean_var_norm: !new:speechbrain.processing.features.InputNormalization
33
+ norm_type: sentence
34
+ std_norm: false
35
+
36
+ modules:
37
+ compute_features: !ref <compute_features>
38
+ mean_var_norm: !ref <mean_var_norm>
39
+ embedding_model: !ref <embedding_model>
40
+ classifier: !ref <classifier>
41
+
42
+ label_encoder: !new:speechbrain.dataio.encoder.CategoricalEncoder
43
+
44
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
45
+ loadables:
46
+ embedding_model: !ref <embedding_model>
47
+ classifier: !ref <classifier>
48
+ label_encoder: !ref <label_encoder>
49
+ paths:
50
+ embedding_model: !ref <pretrained_path>/embedding_model.ckpt
51
+ classifier: !ref <pretrained_path>/classifier.ckpt
52
+ label_encoder: !ref <pretrained_path>/label_encoder.txt
label_encoder.txt ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 'ab' => 0
2
+ 'af' => 1
3
+ 'am' => 2
4
+ 'ar' => 3
5
+ 'as' => 4
6
+ 'az' => 5
7
+ 'ba' => 6
8
+ 'be' => 7
9
+ 'bg' => 8
10
+ 'bn' => 9
11
+ 'bo' => 10
12
+ 'br' => 11
13
+ 'bs' => 12
14
+ 'ca' => 13
15
+ 'ceb' => 14
16
+ 'cs' => 15
17
+ 'cy' => 16
18
+ 'da' => 17
19
+ 'de' => 18
20
+ 'el' => 19
21
+ 'en' => 20
22
+ 'eo' => 21
23
+ 'es' => 22
24
+ 'et' => 23
25
+ 'eu' => 24
26
+ 'fa' => 25
27
+ 'fi' => 26
28
+ 'fo' => 27
29
+ 'fr' => 28
30
+ 'gl' => 29
31
+ 'gn' => 30
32
+ 'gu' => 31
33
+ 'gv' => 32
34
+ 'ha' => 33
35
+ 'haw' => 34
36
+ 'hi' => 35
37
+ 'hr' => 36
38
+ 'ht' => 37
39
+ 'hu' => 38
40
+ 'hy' => 39
41
+ 'ia' => 40
42
+ 'id' => 41
43
+ 'is' => 42
44
+ 'it' => 43
45
+ 'iw' => 44
46
+ 'ja' => 45
47
+ 'jw' => 46
48
+ 'ka' => 47
49
+ 'kk' => 48
50
+ 'km' => 49
51
+ 'kn' => 50
52
+ 'ko' => 51
53
+ 'la' => 52
54
+ 'lb' => 53
55
+ 'ln' => 54
56
+ 'lo' => 55
57
+ 'lt' => 56
58
+ 'lv' => 57
59
+ 'mg' => 58
60
+ 'mi' => 59
61
+ 'mk' => 60
62
+ 'ml' => 61
63
+ 'mn' => 62
64
+ 'mr' => 63
65
+ 'ms' => 64
66
+ 'mt' => 65
67
+ 'my' => 66
68
+ 'ne' => 67
69
+ 'nl' => 68
70
+ 'nn' => 69
71
+ 'no' => 70
72
+ 'oc' => 71
73
+ 'pa' => 72
74
+ 'pl' => 73
75
+ 'ps' => 74
76
+ 'pt' => 75
77
+ 'ro' => 76
78
+ 'ru' => 77
79
+ 'sa' => 78
80
+ 'sco' => 79
81
+ 'sd' => 80
82
+ 'si' => 81
83
+ 'sk' => 82
84
+ 'sl' => 83
85
+ 'sn' => 84
86
+ 'so' => 85
87
+ 'sq' => 86
88
+ 'sr' => 87
89
+ 'su' => 88
90
+ 'sv' => 89
91
+ 'sw' => 90
92
+ 'ta' => 91
93
+ 'te' => 92
94
+ 'tg' => 93
95
+ 'th' => 94
96
+ 'tk' => 95
97
+ 'tl' => 96
98
+ 'tr' => 97
99
+ 'tt' => 98
100
+ 'uk' => 99
101
+ 'ur' => 100
102
+ 'uz' => 101
103
+ 'vi' => 102
104
+ 'war' => 103
105
+ 'yi' => 104
106
+ 'yo' => 105
107
+ 'zh' => 106
108
+ ================
109
+ 'starting_index' => 0
normalizer.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:99327453c38bd629b7479ea440b8efa59332d636555fa6738f1d3e360d6cad28
3
+ size 1153