Alex Gichamba commited on
Commit
63e7c8d
1 Parent(s): 99edf11

Add model files

Browse files
README.md ADDED
@@ -0,0 +1,294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - speaker-recognition
6
+ language: multilingual
7
+ datasets:
8
+ - voxceleb
9
+ - librispeech
10
+ - commonvoice
11
+ license: cc-by-4.0
12
+ ---
13
+
14
+ ## ESPnet2 SPK model
15
+
16
+ ### `espnet/voxcelebs12devs_librispeech_cv16fa_rawnet3`
17
+
18
+ This model was trained by Alexgichamba using sdsv21 recipe in [espnet](https://github.com/espnet/espnet/).
19
+
20
+ ### Demo: How to use in ESPnet2
21
+
22
+ Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
23
+ if you haven't done that already.
24
+
25
+ ```bash
26
+ cd espnet
27
+ git checkout 7ffe306553b905c97948a1d4926132000ee2e1be
28
+ pip install -e .
29
+ cd egs2/voxceleb/spk1
30
+ ./run.sh --skip_data_prep false --skip_train true --download_model espnet/voxcelebs12devs_voxblinkfull_rawnet3
31
+ ```
32
+
33
+ <!-- Generated by scripts/utils/show_spk_result.py -->
34
+ # RESULTS
35
+ ## Environments
36
+ date: 2024-02-08 08:51:46.908694
37
+
38
+ - python version: 3.8.6 (default, Dec 17 2020, 16:57:01) [GCC 10.2.0]
39
+ - espnet version: 202310
40
+ - pytorch version: 2.0.1+cu118
41
+
42
+ ## Test set: Vox1-O
43
+
44
+ | Model (conf name) | EER(%) | minDCF |
45
+ |---|---|---|
46
+ | [conf/train_rawnet3.yaml](conf/train_rawnet3.yaml) | 1.229 | 0.08033 |
47
+
48
+ ## Test set: Sample DeepMine
49
+
50
+ | Model (conf name) | EER(%) | minDCF |
51
+ |---|---|---|
52
+ | [conf/train_rawnet3.yaml](conf/train_rawnet3.yaml) | 4.640 | 0.25994 |
53
+
54
+ ## SPK config
55
+
56
+ <details><summary>expand</summary>
57
+
58
+ ```
59
+ config: conf/train_rawnet3.yaml
60
+ print_config: false
61
+ log_level: INFO
62
+ drop_last_iter: true
63
+ dry_run: false
64
+ iterator_type: category
65
+ valid_iterator_type: sequence
66
+ output_dir: exp/spk_train_rawnet3_raw_sp
67
+ ngpu: 1
68
+ seed: 0
69
+ num_workers: 2
70
+ num_att_plot: 0
71
+ dist_backend: nccl
72
+ dist_init_method: env://
73
+ dist_world_size: 4
74
+ dist_rank: 0
75
+ local_rank: 0
76
+ dist_master_addr: localhost
77
+ dist_master_port: 37073
78
+ dist_launcher: null
79
+ multiprocessing_distributed: true
80
+ unused_parameters: false
81
+ sharded_ddp: false
82
+ cudnn_enabled: true
83
+ cudnn_benchmark: true
84
+ cudnn_deterministic: false
85
+ collect_stats: false
86
+ write_collected_feats: false
87
+ max_epoch: 40
88
+ patience: null
89
+ val_scheduler_criterion:
90
+ - valid
91
+ - loss
92
+ early_stopping_criterion:
93
+ - valid
94
+ - loss
95
+ - min
96
+ best_model_criterion:
97
+ - - valid
98
+ - eer
99
+ - min
100
+ keep_nbest_models: 3
101
+ nbest_averaging_interval: 0
102
+ grad_clip: 9999
103
+ grad_clip_type: 2.0
104
+ grad_noise: false
105
+ accum_grad: 1
106
+ no_forward_run: false
107
+ resume: true
108
+ train_dtype: float32
109
+ use_amp: false
110
+ log_interval: 100
111
+ use_matplotlib: true
112
+ use_tensorboard: true
113
+ create_graph_in_tensorboard: false
114
+ use_wandb: false
115
+ wandb_project: null
116
+ wandb_id: null
117
+ wandb_entity: null
118
+ wandb_name: null
119
+ wandb_model_log_interval: -1
120
+ detect_anomaly: false
121
+ use_lora: false
122
+ save_lora_only: true
123
+ lora_conf: {}
124
+ pretrain_path: null
125
+ init_param: []
126
+ ignore_init_mismatch: false
127
+ freeze_param: []
128
+ num_iters_per_epoch: null
129
+ batch_size: 128
130
+ valid_batch_size: 40
131
+ batch_bins: 1000000
132
+ valid_batch_bins: null
133
+ train_shape_file:
134
+ - exp/spk_stats_16k_sp/train/speech_shape
135
+ valid_shape_file:
136
+ - exp/spk_stats_16k_sp/valid/speech_shape
137
+ batch_type: folded
138
+ valid_batch_type: null
139
+ fold_length:
140
+ - 120000
141
+ sort_in_batch: descending
142
+ shuffle_within_batch: false
143
+ sort_batch: descending
144
+ multiple_iterator: false
145
+ chunk_length: 500
146
+ chunk_shift_ratio: 0.5
147
+ num_cache_chunks: 1024
148
+ chunk_excluded_key_prefixes: []
149
+ chunk_default_fs: null
150
+ train_data_path_and_name_and_type:
151
+ - - dump/raw/combined_train_set_sp/wav.scp
152
+ - speech
153
+ - sound
154
+ - - dump/raw/combined_train_set_sp/utt2spk
155
+ - spk_labels
156
+ - text
157
+ valid_data_path_and_name_and_type:
158
+ - - dump/raw/voxceleb1_test/trial.scp
159
+ - speech
160
+ - sound
161
+ - - dump/raw/voxceleb1_test/trial2.scp
162
+ - speech2
163
+ - sound
164
+ - - dump/raw/voxceleb1_test/trial_label
165
+ - spk_labels
166
+ - text
167
+ allow_variable_data_keys: false
168
+ max_cache_size: 0.0
169
+ max_cache_fd: 32
170
+ allow_multi_rates: false
171
+ valid_max_cache_size: null
172
+ exclude_weight_decay: false
173
+ exclude_weight_decay_conf: {}
174
+ optim: adam
175
+ optim_conf:
176
+ lr: 0.001
177
+ weight_decay: 5.0e-05
178
+ amsgrad: false
179
+ scheduler: cosineannealingwarmuprestarts
180
+ scheduler_conf:
181
+ first_cycle_steps: 158760
182
+ cycle_mult: 1.0
183
+ max_lr: 0.001
184
+ min_lr: 5.0e-06
185
+ warmup_steps: 1000
186
+ gamma: 0.75
187
+ init: null
188
+ use_preprocessor: true
189
+ input_size: null
190
+ target_duration: 3.0
191
+ spk2utt: dump/raw/combined_train_set_sp/spk2utt
192
+ spk_num: 37485
193
+ sample_rate: 16000
194
+ num_eval: 10
195
+ rir_scp: ''
196
+ model_conf:
197
+ extract_feats_in_collect_stats: false
198
+ frontend: asteroid_frontend
199
+ frontend_conf:
200
+ sinc_stride: 16
201
+ sinc_kernel_size: 251
202
+ sinc_filters: 256
203
+ preemph_coef: 0.97
204
+ log_term: 1.0e-06
205
+ specaug: null
206
+ specaug_conf: {}
207
+ normalize: null
208
+ normalize_conf: {}
209
+ encoder: rawnet3
210
+ encoder_conf:
211
+ model_scale: 8
212
+ ndim: 1024
213
+ output_size: 1536
214
+ pooling: chn_attn_stat
215
+ pooling_conf: {}
216
+ projector: rawnet3
217
+ projector_conf:
218
+ output_size: 192
219
+ preprocessor: spk
220
+ preprocessor_conf:
221
+ target_duration: 3.0
222
+ sample_rate: 16000
223
+ num_eval: 5
224
+ noise_apply_prob: 0.5
225
+ noise_info:
226
+ - - 1.0
227
+ - dump/raw/musan_speech.scp
228
+ - - 4
229
+ - 7
230
+ - - 13
231
+ - 20
232
+ - - 1.0
233
+ - dump/raw/musan_noise.scp
234
+ - - 1
235
+ - 1
236
+ - - 0
237
+ - 15
238
+ - - 1.0
239
+ - dump/raw/musan_music.scp
240
+ - - 1
241
+ - 1
242
+ - - 5
243
+ - 15
244
+ rir_apply_prob: 0.5
245
+ rir_scp: dump/raw/rirs.scp
246
+ loss: aamsoftmax_sc_topk
247
+ loss_conf:
248
+ margin: 0.3
249
+ scale: 30
250
+ K: 3
251
+ mp: 0.06
252
+ k_top: 5
253
+ required:
254
+ - output_dir
255
+ version: '202310'
256
+ distributed: true
257
+ ```
258
+
259
+ </details>
260
+
261
+
262
+
263
+ ### Citing ESPnet
264
+
265
+ ```BibTex
266
+ @inproceedings{watanabe2018espnet,
267
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
268
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
269
+ year={2018},
270
+ booktitle={Proceedings of Interspeech},
271
+ pages={2207--2211},
272
+ doi={10.21437/Interspeech.2018-1456},
273
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
274
+ }
275
+
276
+
277
+
278
+
279
+
280
+
281
+ ```
282
+
283
+ or arXiv:
284
+
285
+ ```bibtex
286
+ @misc{watanabe2018espnet,
287
+ title={ESPnet: End-to-End Speech Processing Toolkit},
288
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
289
+ year={2018},
290
+ eprint={1804.00015},
291
+ archivePrefix={arXiv},
292
+ primaryClass={cs.CL}
293
+ }
294
+ ```
exp/spk_train_rawnet3_raw_sp/11epoch.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cefa319d1adf48f525037bd9d101ca9f49ace535c90f7b656f5f4c1ba3242bf9
3
+ size 150867647
exp/spk_train_rawnet3_raw_sp/RESULTS.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- Generated by scripts/utils/show_spk_result.py -->
2
+ # RESULTS
3
+ ## Environments
4
+ date: 2024-02-08 08:51:46.908694
5
+
6
+ - python version: 3.8.6 (default, Dec 17 2020, 16:57:01) [GCC 10.2.0]
7
+ - espnet version: 202310
8
+ - pytorch version: 2.0.1+cu118
9
+
10
+ ## Test set: Vox1-O
11
+
12
+ | Model (conf name) | EER(%) | minDCF |
13
+ |---|---|---|
14
+ | [conf/train_rawnet3.yaml](conf/train_rawnet3.yaml) | 1.229 | 0.08033 |
15
+
16
+ ## Test set: Sample DeepMine
17
+
18
+ | Model (conf name) | EER(%) | minDCF |
19
+ |---|---|---|
20
+ | [conf/train_rawnet3.yaml](conf/train_rawnet3.yaml) | 4.640 | 0.25994 |
exp/spk_train_rawnet3_raw_sp/config.yaml ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/train_rawnet3.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ drop_last_iter: true
5
+ dry_run: false
6
+ iterator_type: category
7
+ valid_iterator_type: sequence
8
+ output_dir: exp/spk_train_rawnet3_raw_sp
9
+ ngpu: 1
10
+ seed: 0
11
+ num_workers: 2
12
+ num_att_plot: 0
13
+ dist_backend: nccl
14
+ dist_init_method: env://
15
+ dist_world_size: 4
16
+ dist_rank: 0
17
+ local_rank: 0
18
+ dist_master_addr: localhost
19
+ dist_master_port: 37073
20
+ dist_launcher: null
21
+ multiprocessing_distributed: true
22
+ unused_parameters: false
23
+ sharded_ddp: false
24
+ cudnn_enabled: true
25
+ cudnn_benchmark: true
26
+ cudnn_deterministic: false
27
+ collect_stats: false
28
+ write_collected_feats: false
29
+ max_epoch: 40
30
+ patience: null
31
+ val_scheduler_criterion:
32
+ - valid
33
+ - loss
34
+ early_stopping_criterion:
35
+ - valid
36
+ - loss
37
+ - min
38
+ best_model_criterion:
39
+ - - valid
40
+ - eer
41
+ - min
42
+ keep_nbest_models: 3
43
+ nbest_averaging_interval: 0
44
+ grad_clip: 9999
45
+ grad_clip_type: 2.0
46
+ grad_noise: false
47
+ accum_grad: 1
48
+ no_forward_run: false
49
+ resume: true
50
+ train_dtype: float32
51
+ use_amp: false
52
+ log_interval: 100
53
+ use_matplotlib: true
54
+ use_tensorboard: true
55
+ create_graph_in_tensorboard: false
56
+ use_wandb: false
57
+ wandb_project: null
58
+ wandb_id: null
59
+ wandb_entity: null
60
+ wandb_name: null
61
+ wandb_model_log_interval: -1
62
+ detect_anomaly: false
63
+ use_lora: false
64
+ save_lora_only: true
65
+ lora_conf: {}
66
+ pretrain_path: null
67
+ init_param: []
68
+ ignore_init_mismatch: false
69
+ freeze_param: []
70
+ num_iters_per_epoch: null
71
+ batch_size: 128
72
+ valid_batch_size: 40
73
+ batch_bins: 1000000
74
+ valid_batch_bins: null
75
+ train_shape_file:
76
+ - exp/spk_stats_16k_sp/train/speech_shape
77
+ valid_shape_file:
78
+ - exp/spk_stats_16k_sp/valid/speech_shape
79
+ batch_type: folded
80
+ valid_batch_type: null
81
+ fold_length:
82
+ - 120000
83
+ sort_in_batch: descending
84
+ shuffle_within_batch: false
85
+ sort_batch: descending
86
+ multiple_iterator: false
87
+ chunk_length: 500
88
+ chunk_shift_ratio: 0.5
89
+ num_cache_chunks: 1024
90
+ chunk_excluded_key_prefixes: []
91
+ chunk_default_fs: null
92
+ train_data_path_and_name_and_type:
93
+ - - dump/raw/combined_train_set_sp/wav.scp
94
+ - speech
95
+ - sound
96
+ - - dump/raw/combined_train_set_sp/utt2spk
97
+ - spk_labels
98
+ - text
99
+ valid_data_path_and_name_and_type:
100
+ - - dump/raw/voxceleb1_test/trial.scp
101
+ - speech
102
+ - sound
103
+ - - dump/raw/voxceleb1_test/trial2.scp
104
+ - speech2
105
+ - sound
106
+ - - dump/raw/voxceleb1_test/trial_label
107
+ - spk_labels
108
+ - text
109
+ allow_variable_data_keys: false
110
+ max_cache_size: 0.0
111
+ max_cache_fd: 32
112
+ allow_multi_rates: false
113
+ valid_max_cache_size: null
114
+ exclude_weight_decay: false
115
+ exclude_weight_decay_conf: {}
116
+ optim: adam
117
+ optim_conf:
118
+ lr: 0.001
119
+ weight_decay: 5.0e-05
120
+ amsgrad: false
121
+ scheduler: cosineannealingwarmuprestarts
122
+ scheduler_conf:
123
+ first_cycle_steps: 158760
124
+ cycle_mult: 1.0
125
+ max_lr: 0.001
126
+ min_lr: 5.0e-06
127
+ warmup_steps: 1000
128
+ gamma: 0.75
129
+ init: null
130
+ use_preprocessor: true
131
+ input_size: null
132
+ target_duration: 3.0
133
+ spk2utt: dump/raw/combined_train_set_sp/spk2utt
134
+ spk_num: 37485
135
+ sample_rate: 16000
136
+ num_eval: 10
137
+ rir_scp: ''
138
+ model_conf:
139
+ extract_feats_in_collect_stats: false
140
+ frontend: asteroid_frontend
141
+ frontend_conf:
142
+ sinc_stride: 16
143
+ sinc_kernel_size: 251
144
+ sinc_filters: 256
145
+ preemph_coef: 0.97
146
+ log_term: 1.0e-06
147
+ specaug: null
148
+ specaug_conf: {}
149
+ normalize: null
150
+ normalize_conf: {}
151
+ encoder: rawnet3
152
+ encoder_conf:
153
+ model_scale: 8
154
+ ndim: 1024
155
+ output_size: 1536
156
+ pooling: chn_attn_stat
157
+ pooling_conf: {}
158
+ projector: rawnet3
159
+ projector_conf:
160
+ output_size: 192
161
+ preprocessor: spk
162
+ preprocessor_conf:
163
+ target_duration: 3.0
164
+ sample_rate: 16000
165
+ num_eval: 5
166
+ noise_apply_prob: 0.5
167
+ noise_info:
168
+ - - 1.0
169
+ - dump/raw/musan_speech.scp
170
+ - - 4
171
+ - 7
172
+ - - 13
173
+ - 20
174
+ - - 1.0
175
+ - dump/raw/musan_noise.scp
176
+ - - 1
177
+ - 1
178
+ - - 0
179
+ - 15
180
+ - - 1.0
181
+ - dump/raw/musan_music.scp
182
+ - - 1
183
+ - 1
184
+ - - 5
185
+ - 15
186
+ rir_apply_prob: 0.5
187
+ rir_scp: dump/raw/rirs.scp
188
+ loss: aamsoftmax_sc_topk
189
+ loss_conf:
190
+ margin: 0.3
191
+ scale: 30
192
+ K: 3
193
+ mp: 0.06
194
+ k_top: 5
195
+ required:
196
+ - output_dir
197
+ version: '202310'
198
+ distributed: true
exp/spk_train_rawnet3_raw_sp/images/backward_time.png ADDED
exp/spk_train_rawnet3_raw_sp/images/clip.png ADDED
exp/spk_train_rawnet3_raw_sp/images/eer.png ADDED
exp/spk_train_rawnet3_raw_sp/images/forward_time.png ADDED
exp/spk_train_rawnet3_raw_sp/images/gpu_max_cached_mem_GB.png ADDED
exp/spk_train_rawnet3_raw_sp/images/grad_norm.png ADDED
exp/spk_train_rawnet3_raw_sp/images/iter_time.png ADDED
exp/spk_train_rawnet3_raw_sp/images/loss.png ADDED
exp/spk_train_rawnet3_raw_sp/images/loss_scale.png ADDED
exp/spk_train_rawnet3_raw_sp/images/mindcf.png ADDED
exp/spk_train_rawnet3_raw_sp/images/n_trials.png ADDED
exp/spk_train_rawnet3_raw_sp/images/nontrg_mean.png ADDED
exp/spk_train_rawnet3_raw_sp/images/nontrg_std.png ADDED
exp/spk_train_rawnet3_raw_sp/images/optim0_lr0.png ADDED
exp/spk_train_rawnet3_raw_sp/images/optim_step_time.png ADDED
exp/spk_train_rawnet3_raw_sp/images/train_time.png ADDED
exp/spk_train_rawnet3_raw_sp/images/trg_mean.png ADDED
exp/spk_train_rawnet3_raw_sp/images/trg_std.png ADDED
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202310'
2
+ files:
3
+ model_file: exp/spk_train_rawnet3_raw_sp/11epoch.pth
4
+ python: "3.8.6 (default, Dec 17 2020, 16:57:01) \n [GCC 10.2.0]"
5
+ timestamp: 1707382306.908694
6
+ torch: 2.0.1
7
+ yaml_files:
8
+ train_config: exp/spk_train_rawnet3_raw_sp/config.yaml