ESPnet
multilingual
audio
speaker-recognition
jungjee commited on
Commit
cc833f5
1 Parent(s): 79c7526

Update model

Browse files
README.md ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - speaker-recognition
6
+ language: multilingual
7
+ datasets:
8
+ - voxblink
9
+ license: cc-by-4.0
10
+ ---
11
+
12
+ ## ESPnet2 SPK model
13
+
14
+ ### `espnet/voxblinkclean_rawnet3`
15
+
16
+ This model was trained by Jungjee using voxblink recipe in [espnet](https://github.com/espnet/espnet/).
17
+
18
+ ### Demo: How to use in ESPnet2
19
+
20
+ Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
21
+ if you haven't done that already.
22
+
23
+ ```bash
24
+ cd espnet
25
+ git checkout 6a0be4d44b892a683c2d617039c7f23e824a9296
26
+ pip install -e .
27
+ cd egs2/voxblink/spk1
28
+ ./run.sh --skip_data_prep false --skip_train true --download_model espnet/voxblinkclean_rawnet3
29
+ ```
30
+
31
+ <!-- Generated by scripts/utils/show_spk_result.py -->
32
+ # RESULTS
33
+ ## Environments
34
+ date: 2024-01-03 18:56:30.429852
35
+
36
+ - python version: 3.9.16 (main, May 15 2023, 23:46:34) [GCC 11.2.0]
37
+ - espnet version: 202310
38
+ - pytorch version: 1.13.1
39
+
40
+ | | Mean | Std |
41
+ |---|---|---|
42
+ | Target | 5.2187 | 4.6926 |
43
+ | Non-target | 2.5139 | 2.5139 |
44
+
45
+ | Model name | EER(%) | minDCF |
46
+ |---|---|---|
47
+ | conf/tuning/train_rawnet3_vbClean | 2.516 | 0.18585 |
48
+
49
+ ## SPK config
50
+
51
+ <details><summary>expand</summary>
52
+
53
+ ```
54
+ config: conf/tuning/train_rawnet3_vbClean.yaml
55
+ print_config: false
56
+ log_level: INFO
57
+ drop_last_iter: true
58
+ dry_run: false
59
+ iterator_type: category
60
+ valid_iterator_type: sequence
61
+ output_dir: exp/spk_train_rawnet3_vbClean_raw_sp
62
+ ngpu: 1
63
+ seed: 0
64
+ num_workers: 6
65
+ num_att_plot: 0
66
+ dist_backend: nccl
67
+ dist_init_method: env://
68
+ dist_world_size: 4
69
+ dist_rank: 0
70
+ local_rank: 0
71
+ dist_master_addr: localhost
72
+ dist_master_port: 52559
73
+ dist_launcher: null
74
+ multiprocessing_distributed: true
75
+ unused_parameters: false
76
+ sharded_ddp: false
77
+ cudnn_enabled: true
78
+ cudnn_benchmark: true
79
+ cudnn_deterministic: false
80
+ collect_stats: false
81
+ write_collected_feats: false
82
+ max_epoch: 40
83
+ patience: null
84
+ val_scheduler_criterion:
85
+ - valid
86
+ - loss
87
+ early_stopping_criterion:
88
+ - valid
89
+ - loss
90
+ - min
91
+ best_model_criterion:
92
+ - - valid
93
+ - eer
94
+ - min
95
+ keep_nbest_models: 3
96
+ nbest_averaging_interval: 0
97
+ grad_clip: 9999
98
+ grad_clip_type: 2.0
99
+ grad_noise: false
100
+ accum_grad: 1
101
+ no_forward_run: false
102
+ resume: true
103
+ train_dtype: float32
104
+ use_amp: true
105
+ log_interval: 100
106
+ use_matplotlib: true
107
+ use_tensorboard: true
108
+ create_graph_in_tensorboard: false
109
+ use_wandb: false
110
+ wandb_project: null
111
+ wandb_id: null
112
+ wandb_entity: null
113
+ wandb_name: null
114
+ wandb_model_log_interval: -1
115
+ detect_anomaly: false
116
+ use_lora: false
117
+ save_lora_only: true
118
+ lora_conf: {}
119
+ pretrain_path: null
120
+ init_param: []
121
+ ignore_init_mismatch: false
122
+ freeze_param: []
123
+ num_iters_per_epoch: null
124
+ batch_size: 512
125
+ valid_batch_size: 40
126
+ batch_bins: 1000000
127
+ valid_batch_bins: null
128
+ train_shape_file:
129
+ - exp/spk_stats_16k_sp/train/speech_shape
130
+ valid_shape_file:
131
+ - exp/spk_stats_16k_sp/valid/speech_shape
132
+ batch_type: folded
133
+ valid_batch_type: null
134
+ fold_length:
135
+ - 120000
136
+ sort_in_batch: descending
137
+ shuffle_within_batch: false
138
+ sort_batch: descending
139
+ multiple_iterator: false
140
+ chunk_length: 500
141
+ chunk_shift_ratio: 0.5
142
+ num_cache_chunks: 1024
143
+ chunk_excluded_key_prefixes: []
144
+ chunk_default_fs: null
145
+ train_data_path_and_name_and_type:
146
+ - - dump/raw/voxblink_clean_sp/wav.scp
147
+ - speech
148
+ - sound
149
+ - - dump/raw/voxblink_clean_sp/utt2spk
150
+ - spk_labels
151
+ - text
152
+ valid_data_path_and_name_and_type:
153
+ - - dump/raw/voxceleb1_test/trial.scp
154
+ - speech
155
+ - sound
156
+ - - dump/raw/voxceleb1_test/trial2.scp
157
+ - speech2
158
+ - sound
159
+ - - dump/raw/voxceleb1_test/trial_label
160
+ - spk_labels
161
+ - text
162
+ allow_variable_data_keys: false
163
+ max_cache_size: 0.0
164
+ max_cache_fd: 32
165
+ allow_multi_rates: false
166
+ valid_max_cache_size: null
167
+ exclude_weight_decay: false
168
+ exclude_weight_decay_conf: {}
169
+ optim: adam
170
+ optim_conf:
171
+ lr: 0.001
172
+ weight_decay: 5.0e-05
173
+ amsgrad: false
174
+ scheduler: cosineannealingwarmuprestarts
175
+ scheduler_conf:
176
+ first_cycle_steps: 59480
177
+ cycle_mult: 1.0
178
+ max_lr: 0.001
179
+ min_lr: 5.0e-06
180
+ warmup_steps: 1000
181
+ gamma: 0.75
182
+ init: null
183
+ use_preprocessor: true
184
+ input_size: null
185
+ target_duration: 3.0
186
+ spk2utt: dump/raw/voxblink_clean_sp/spk2utt
187
+ spk_num: 55143
188
+ sample_rate: 16000
189
+ num_eval: 10
190
+ rir_scp: ''
191
+ model_conf:
192
+ extract_feats_in_collect_stats: false
193
+ frontend: asteroid_frontend
194
+ frontend_conf:
195
+ sinc_stride: 16
196
+ sinc_kernel_size: 251
197
+ sinc_filters: 256
198
+ preemph_coef: 0.97
199
+ log_term: 1.0e-06
200
+ specaug: null
201
+ specaug_conf: {}
202
+ normalize: null
203
+ normalize_conf: {}
204
+ encoder: rawnet3
205
+ encoder_conf:
206
+ model_scale: 8
207
+ ndim: 1024
208
+ output_size: 1536
209
+ pooling: chn_attn_stat
210
+ pooling_conf: {}
211
+ projector: rawnet3
212
+ projector_conf:
213
+ output_size: 192
214
+ preprocessor: spk
215
+ preprocessor_conf:
216
+ target_duration: 3.0
217
+ sample_rate: 16000
218
+ num_eval: 5
219
+ noise_apply_prob: 0.5
220
+ noise_info:
221
+ - - 1.0
222
+ - dump/raw/musan_speech.scp
223
+ - - 4
224
+ - 7
225
+ - - 13
226
+ - 20
227
+ - - 1.0
228
+ - dump/raw/musan_noise.scp
229
+ - - 1
230
+ - 1
231
+ - - 0
232
+ - 15
233
+ - - 1.0
234
+ - dump/raw/musan_music.scp
235
+ - - 1
236
+ - 1
237
+ - - 5
238
+ - 15
239
+ rir_apply_prob: 0.5
240
+ rir_scp: dump/raw/rirs.scp
241
+ loss: aamsoftmax_sc_topk
242
+ loss_conf:
243
+ margin: 0.3
244
+ scale: 30
245
+ K: 3
246
+ mp: 0.06
247
+ k_top: 5
248
+ required:
249
+ - output_dir
250
+ version: '202310'
251
+ distributed: true
252
+ ```
253
+
254
+ </details>
255
+
256
+
257
+
258
+ ### Citing ESPnet
259
+
260
+ ```BibTex
261
+ @inproceedings{watanabe2018espnet,
262
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
263
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
264
+ year={2018},
265
+ booktitle={Proceedings of Interspeech},
266
+ pages={2207--2211},
267
+ doi={10.21437/Interspeech.2018-1456},
268
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
269
+ }
270
+
271
+
272
+
273
+
274
+
275
+
276
+ ```
277
+
278
+ or arXiv:
279
+
280
+ ```bibtex
281
+ @misc{watanabe2018espnet,
282
+ title={ESPnet: End-to-End Speech Processing Toolkit},
283
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
284
+ year={2018},
285
+ eprint={1804.00015},
286
+ archivePrefix={arXiv},
287
+ primaryClass={cs.CL}
288
+ }
289
+ ```
exp/spk_train_rawnet3_vbClean_raw_sp/30epoch.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74c2812bb6bfce28d1313d5822668b912ba070d3647cb180c6bca3f5b85c7db0
3
+ size 191551679
exp/spk_train_rawnet3_vbClean_raw_sp/RESULTS.md ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- Generated by scripts/utils/show_spk_result.py -->
2
+ # RESULTS
3
+ ## Environments
4
+ date: 2024-01-03 18:56:30.429852
5
+
6
+ - python version: 3.9.16 (main, May 15 2023, 23:46:34) [GCC 11.2.0]
7
+ - espnet version: 202310
8
+ - pytorch version: 1.13.1
9
+
10
+ | | Mean | Std |
11
+ |---|---|---|
12
+ | Target | 5.2187 | 4.6926 |
13
+ | Non-target | 2.5139 | 2.5139 |
14
+
15
+ | Model name | EER(%) | minDCF |
16
+ |---|---|---|
17
+ | conf/tuning/train_rawnet3_vbClean | 2.516 | 0.18585 |
exp/spk_train_rawnet3_vbClean_raw_sp/config.yaml ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/tuning/train_rawnet3_vbClean.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ drop_last_iter: true
5
+ dry_run: false
6
+ iterator_type: category
7
+ valid_iterator_type: sequence
8
+ output_dir: exp/spk_train_rawnet3_vbClean_raw_sp
9
+ ngpu: 1
10
+ seed: 0
11
+ num_workers: 6
12
+ num_att_plot: 0
13
+ dist_backend: nccl
14
+ dist_init_method: env://
15
+ dist_world_size: 4
16
+ dist_rank: 0
17
+ local_rank: 0
18
+ dist_master_addr: localhost
19
+ dist_master_port: 52559
20
+ dist_launcher: null
21
+ multiprocessing_distributed: true
22
+ unused_parameters: false
23
+ sharded_ddp: false
24
+ cudnn_enabled: true
25
+ cudnn_benchmark: true
26
+ cudnn_deterministic: false
27
+ collect_stats: false
28
+ write_collected_feats: false
29
+ max_epoch: 40
30
+ patience: null
31
+ val_scheduler_criterion:
32
+ - valid
33
+ - loss
34
+ early_stopping_criterion:
35
+ - valid
36
+ - loss
37
+ - min
38
+ best_model_criterion:
39
+ - - valid
40
+ - eer
41
+ - min
42
+ keep_nbest_models: 3
43
+ nbest_averaging_interval: 0
44
+ grad_clip: 9999
45
+ grad_clip_type: 2.0
46
+ grad_noise: false
47
+ accum_grad: 1
48
+ no_forward_run: false
49
+ resume: true
50
+ train_dtype: float32
51
+ use_amp: true
52
+ log_interval: 100
53
+ use_matplotlib: true
54
+ use_tensorboard: true
55
+ create_graph_in_tensorboard: false
56
+ use_wandb: false
57
+ wandb_project: null
58
+ wandb_id: null
59
+ wandb_entity: null
60
+ wandb_name: null
61
+ wandb_model_log_interval: -1
62
+ detect_anomaly: false
63
+ use_lora: false
64
+ save_lora_only: true
65
+ lora_conf: {}
66
+ pretrain_path: null
67
+ init_param: []
68
+ ignore_init_mismatch: false
69
+ freeze_param: []
70
+ num_iters_per_epoch: null
71
+ batch_size: 512
72
+ valid_batch_size: 40
73
+ batch_bins: 1000000
74
+ valid_batch_bins: null
75
+ train_shape_file:
76
+ - exp/spk_stats_16k_sp/train/speech_shape
77
+ valid_shape_file:
78
+ - exp/spk_stats_16k_sp/valid/speech_shape
79
+ batch_type: folded
80
+ valid_batch_type: null
81
+ fold_length:
82
+ - 120000
83
+ sort_in_batch: descending
84
+ shuffle_within_batch: false
85
+ sort_batch: descending
86
+ multiple_iterator: false
87
+ chunk_length: 500
88
+ chunk_shift_ratio: 0.5
89
+ num_cache_chunks: 1024
90
+ chunk_excluded_key_prefixes: []
91
+ chunk_default_fs: null
92
+ train_data_path_and_name_and_type:
93
+ - - dump/raw/voxblink_clean_sp/wav.scp
94
+ - speech
95
+ - sound
96
+ - - dump/raw/voxblink_clean_sp/utt2spk
97
+ - spk_labels
98
+ - text
99
+ valid_data_path_and_name_and_type:
100
+ - - dump/raw/voxceleb1_test/trial.scp
101
+ - speech
102
+ - sound
103
+ - - dump/raw/voxceleb1_test/trial2.scp
104
+ - speech2
105
+ - sound
106
+ - - dump/raw/voxceleb1_test/trial_label
107
+ - spk_labels
108
+ - text
109
+ allow_variable_data_keys: false
110
+ max_cache_size: 0.0
111
+ max_cache_fd: 32
112
+ allow_multi_rates: false
113
+ valid_max_cache_size: null
114
+ exclude_weight_decay: false
115
+ exclude_weight_decay_conf: {}
116
+ optim: adam
117
+ optim_conf:
118
+ lr: 0.001
119
+ weight_decay: 5.0e-05
120
+ amsgrad: false
121
+ scheduler: cosineannealingwarmuprestarts
122
+ scheduler_conf:
123
+ first_cycle_steps: 59480
124
+ cycle_mult: 1.0
125
+ max_lr: 0.001
126
+ min_lr: 5.0e-06
127
+ warmup_steps: 1000
128
+ gamma: 0.75
129
+ init: null
130
+ use_preprocessor: true
131
+ input_size: null
132
+ target_duration: 3.0
133
+ spk2utt: dump/raw/voxblink_clean_sp/spk2utt
134
+ spk_num: 55143
135
+ sample_rate: 16000
136
+ num_eval: 10
137
+ rir_scp: ''
138
+ model_conf:
139
+ extract_feats_in_collect_stats: false
140
+ frontend: asteroid_frontend
141
+ frontend_conf:
142
+ sinc_stride: 16
143
+ sinc_kernel_size: 251
144
+ sinc_filters: 256
145
+ preemph_coef: 0.97
146
+ log_term: 1.0e-06
147
+ specaug: null
148
+ specaug_conf: {}
149
+ normalize: null
150
+ normalize_conf: {}
151
+ encoder: rawnet3
152
+ encoder_conf:
153
+ model_scale: 8
154
+ ndim: 1024
155
+ output_size: 1536
156
+ pooling: chn_attn_stat
157
+ pooling_conf: {}
158
+ projector: rawnet3
159
+ projector_conf:
160
+ output_size: 192
161
+ preprocessor: spk
162
+ preprocessor_conf:
163
+ target_duration: 3.0
164
+ sample_rate: 16000
165
+ num_eval: 5
166
+ noise_apply_prob: 0.5
167
+ noise_info:
168
+ - - 1.0
169
+ - dump/raw/musan_speech.scp
170
+ - - 4
171
+ - 7
172
+ - - 13
173
+ - 20
174
+ - - 1.0
175
+ - dump/raw/musan_noise.scp
176
+ - - 1
177
+ - 1
178
+ - - 0
179
+ - 15
180
+ - - 1.0
181
+ - dump/raw/musan_music.scp
182
+ - - 1
183
+ - 1
184
+ - - 5
185
+ - 15
186
+ rir_apply_prob: 0.5
187
+ rir_scp: dump/raw/rirs.scp
188
+ loss: aamsoftmax_sc_topk
189
+ loss_conf:
190
+ margin: 0.3
191
+ scale: 30
192
+ K: 3
193
+ mp: 0.06
194
+ k_top: 5
195
+ required:
196
+ - output_dir
197
+ version: '202310'
198
+ distributed: true
exp/spk_train_rawnet3_vbClean_raw_sp/images/backward_time.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/clip.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/eer.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/forward_time.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/gpu_max_cached_mem_GB.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/grad_norm.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/iter_time.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/loss.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/loss_scale.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/mindcf.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/n_trials.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/nontrg_mean.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/nontrg_std.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/optim0_lr0.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/optim_step_time.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/train_time.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/trg_mean.png ADDED
exp/spk_train_rawnet3_vbClean_raw_sp/images/trg_std.png ADDED
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202310'
2
+ files:
3
+ model_file: exp/spk_train_rawnet3_vbClean_raw_sp/30epoch.pth
4
+ python: "3.9.16 (main, May 15 2023, 23:46:34) \n[GCC 11.2.0]"
5
+ timestamp: 1704326994.034713
6
+ torch: 1.13.1
7
+ yaml_files:
8
+ train_config: exp/spk_train_rawnet3_vbClean_raw_sp/config.yaml