Radu-Sebastian Amarie commited on
Commit
cac4808
1 Parent(s): 20d0890

Converted WeSpeakerResnet293 to pyannote

Browse files
Files changed (3) hide show
  1. README.md +111 -3
  2. config.yaml +88 -0
  3. pytorch_model.bin +3 -0
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - pyannote
4
+ - pyannote-audio
5
+ - pyannote-audio-model
6
+ - wespeaker
7
+ - audio
8
+ - voice
9
+ - speech
10
+ - speaker
11
+ - speaker-recognition
12
+ - speaker-verification
13
+ - speaker-identification
14
+ - speaker-embedding
15
+ datasets:
16
+ - voxceleb
17
+ license: cc-by-4.0
18
+ inference: false
19
+ ---
20
+
21
+ Using this open-source model in production?
22
+ Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.
23
+
24
+ # 🎹 Wrapper around wespeaker-voxceleb-resnet34-LM
25
+
26
+ This model requires `pyannote.audio` version 3.1 or higher.
27
+
28
+ This is a wrapper around [WeSpeaker](https://github.com/wenet-e2e/wespeaker) `wespeaker-voxceleb-resnet34-LM` pretrained speaker embedding model, for use in `pyannote.audio`.
29
+
30
+ ## Basic usage
31
+
32
+ ```python
33
+ # instantiate pretrained model
34
+ from pyannote.audio import Model
35
+ model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")
36
+ ```
37
+
38
+ ```python
39
+ from pyannote.audio import Inference
40
+ inference = Inference(model, window="whole")
41
+ embedding1 = inference("speaker1.wav")
42
+ embedding2 = inference("speaker2.wav")
43
+ # `embeddingX` is (1 x D) numpy array extracted from the file as a whole.
44
+
45
+ from scipy.spatial.distance import cdist
46
+ distance = cdist(embedding1, embedding2, metric="cosine")[0,0]
47
+ # `distance` is a `float` describing how dissimilar speakers 1 and 2 are.
48
+ ```
49
+
50
+ ## Advanced usage
51
+
52
+ ### Running on GPU
53
+
54
+ ```python
55
+ import torch
56
+ inference.to(torch.device("cuda"))
57
+ embedding = inference("audio.wav")
58
+ ```
59
+
60
+ ### Extract embedding from an excerpt
61
+
62
+ ```python
63
+ from pyannote.audio import Inference
64
+ from pyannote.core import Segment
65
+ inference = Inference(model, window="whole")
66
+ excerpt = Segment(13.37, 19.81)
67
+ embedding = inference.crop("audio.wav", excerpt)
68
+ # `embedding` is (1 x D) numpy array extracted from the file excerpt.
69
+ ```
70
+
71
+ ### Extract embeddings using a sliding window
72
+
73
+ ```python
74
+ from pyannote.audio import Inference
75
+ inference = Inference(model, window="sliding",
76
+ duration=3.0, step=1.0)
77
+ embeddings = inference("audio.wav")
78
+ # `embeddings` is a (N x D) pyannote.core.SlidingWindowFeature
79
+ # `embeddings[i]` is the embedding of the ith position of the
80
+ # sliding window, i.e. from [i * step, i * step + duration].
81
+ ```
82
+
83
+ ## License
84
+
85
+ According to [this page](https://github.com/wenet-e2e/wespeaker/blob/master/docs/pretrained.md):
86
+
87
+ > The pretrained model in WeNet follows the license of it's corresponding dataset. For example, the pretrained model on VoxCeleb follows Creative Commons Attribution 4.0 International License., since it is used as license of the VoxCeleb dataset, see https://mm.kaist.ac.kr/datasets/voxceleb/.
88
+
89
+ ## Citation
90
+
91
+ ```bibtex
92
+ @inproceedings{Wang2023,
93
+ title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
94
+ author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
95
+ booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
96
+ pages={1--5},
97
+ year={2023},
98
+ organization={IEEE}
99
+ }
100
+ ```
101
+
102
+ ```bibtex
103
+ @inproceedings{Bredin23,
104
+ author={Hervé Bredin},
105
+ title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
106
+ year=2023,
107
+ booktitle={Proc. INTERSPEECH 2023},
108
+ pages={1983--1987},
109
+ doi={10.21437/Interspeech.2023-105}
110
+ }
111
+ ```
config.yaml ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ _target_: pyannote.audio.models.embedding.WeSpeakerResNet293
3
+ sample_rate: 16000
4
+ num_channels: 1
5
+ num_mel_bins: 80
6
+ frame_length: 25
7
+ frame_shift: 10
8
+ dither: 1.0
9
+ window_type: hamming
10
+ use_energy: false
11
+ model_args:
12
+ embed_dim: 256
13
+ feat_dim: 80
14
+ pooling_func: TSTP
15
+ two_emb_layer: false
16
+ data_type: shard
17
+ dataloader_args:
18
+ batch_size: 32
19
+ drop_last: true
20
+ num_workers: 16
21
+ pin_memory: false
22
+ prefetch_factor: 8
23
+ dataset_args:
24
+ aug_prob: 0.6
25
+ fbank_args:
26
+ dither: 1.0
27
+ frame_length: 25
28
+ frame_shift: 10
29
+ num_mel_bins: 80
30
+ num_frms: 200
31
+ shuffle: true
32
+ shuffle_args:
33
+ shuffle_size: 2500
34
+ spec_aug: false
35
+ spec_aug_args:
36
+ max_f: 8
37
+ max_t: 10
38
+ num_f_mask: 1
39
+ num_t_mask: 1
40
+ prob: 0.6
41
+ speed_perturb: true
42
+ exp_dir: exp/ResNet293-TSTP-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-SGD-epoch150
43
+ gpus:
44
+ - 0
45
+ - 1
46
+ log_batch_interval: 100
47
+ loss: CrossEntropyLoss
48
+ loss_args: {}
49
+ margin_scheduler: MarginScheduler
50
+ margin_update:
51
+ epoch_iter: 17062
52
+ final_margin: 0.2
53
+ fix_start_epoch: 40
54
+ increase_start_epoch: 20
55
+ increase_type: exp
56
+ initial_margin: 0.0
57
+ update_margin: true
58
+
59
+ model_init: null
60
+ noise_data: data/musan/lmdb
61
+ num_avg: 2
62
+ num_epochs: 150
63
+ optimizer: SGD
64
+ optimizer_args:
65
+ lr: 0.1
66
+ momentum: 0.9
67
+ nesterov: true
68
+ weight_decay: 0.0001
69
+ projection_args:
70
+ easy_margin: false
71
+ embed_dim: 256
72
+ num_class: 17982
73
+ project_type: arc_margin
74
+ scale: 32.0
75
+ reverb_data: data/rirs/lmdb
76
+ save_epoch_interval: 5
77
+ scheduler: ExponentialDecrease
78
+ scheduler_args:
79
+ epoch_iter: 17062
80
+ final_lr: 5.0e-05
81
+ initial_lr: 0.1
82
+ num_epochs: 150
83
+ scale_ratio: 1.0
84
+ warm_from_zero: true
85
+ warm_up_epoch: 6
86
+ seed: 42
87
+ train_data: data/vox2_dev/shard.list
88
+ train_label: data/vox2_dev/utt2spk
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:11f7629bb2d160c4e395d305e0a09050eca7b98fb46efeca89021d14681ee7c5
3
+ size 115615132