Duplicate from pyannote/wespeaker-voxceleb-resnet34-LM
Browse filesCo-authored-by: Hervé Bredin <hbredin@users.noreply.huggingface.co>
- .gitattributes +35 -0
- README.md +111 -0
- config.yaml +10 -0
- pytorch_model.bin +3 -0
.gitattributes
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- pyannote
|
4 |
+
- pyannote-audio
|
5 |
+
- pyannote-audio-model
|
6 |
+
- wespeaker
|
7 |
+
- audio
|
8 |
+
- voice
|
9 |
+
- speech
|
10 |
+
- speaker
|
11 |
+
- speaker-recognition
|
12 |
+
- speaker-verification
|
13 |
+
- speaker-identification
|
14 |
+
- speaker-embedding
|
15 |
+
datasets:
|
16 |
+
- voxceleb
|
17 |
+
license: cc-by-4.0
|
18 |
+
inference: false
|
19 |
+
---
|
20 |
+
|
21 |
+
Using this open-source model in production?
|
22 |
+
Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.
|
23 |
+
|
24 |
+
# 🎹 Wrapper around wespeaker-voxceleb-resnet34-LM
|
25 |
+
|
26 |
+
This model requires `pyannote.audio` version 3.1 or higher.
|
27 |
+
|
28 |
+
This is a wrapper around [WeSpeaker](https://github.com/wenet-e2e/wespeaker) `wespeaker-voxceleb-resnet34-LM` pretrained speaker embedding model, for use in `pyannote.audio`.
|
29 |
+
|
30 |
+
## Basic usage
|
31 |
+
|
32 |
+
```python
|
33 |
+
# instantiate pretrained model
|
34 |
+
from pyannote.audio import Model
|
35 |
+
model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")
|
36 |
+
```
|
37 |
+
|
38 |
+
```python
|
39 |
+
from pyannote.audio import Inference
|
40 |
+
inference = Inference(model, window="whole")
|
41 |
+
embedding1 = inference("speaker1.wav")
|
42 |
+
embedding2 = inference("speaker2.wav")
|
43 |
+
# `embeddingX` is (1 x D) numpy array extracted from the file as a whole.
|
44 |
+
|
45 |
+
from scipy.spatial.distance import cdist
|
46 |
+
distance = cdist(embedding1, embedding2, metric="cosine")[0,0]
|
47 |
+
# `distance` is a `float` describing how dissimilar speakers 1 and 2 are.
|
48 |
+
```
|
49 |
+
|
50 |
+
## Advanced usage
|
51 |
+
|
52 |
+
### Running on GPU
|
53 |
+
|
54 |
+
```python
|
55 |
+
import torch
|
56 |
+
inference.to(torch.device("cuda"))
|
57 |
+
embedding = inference("audio.wav")
|
58 |
+
```
|
59 |
+
|
60 |
+
### Extract embedding from an excerpt
|
61 |
+
|
62 |
+
```python
|
63 |
+
from pyannote.audio import Inference
|
64 |
+
from pyannote.core import Segment
|
65 |
+
inference = Inference(model, window="whole")
|
66 |
+
excerpt = Segment(13.37, 19.81)
|
67 |
+
embedding = inference.crop("audio.wav", excerpt)
|
68 |
+
# `embedding` is (1 x D) numpy array extracted from the file excerpt.
|
69 |
+
```
|
70 |
+
|
71 |
+
### Extract embeddings using a sliding window
|
72 |
+
|
73 |
+
```python
|
74 |
+
from pyannote.audio import Inference
|
75 |
+
inference = Inference(model, window="sliding",
|
76 |
+
duration=3.0, step=1.0)
|
77 |
+
embeddings = inference("audio.wav")
|
78 |
+
# `embeddings` is a (N x D) pyannote.core.SlidingWindowFeature
|
79 |
+
# `embeddings[i]` is the embedding of the ith position of the
|
80 |
+
# sliding window, i.e. from [i * step, i * step + duration].
|
81 |
+
```
|
82 |
+
|
83 |
+
## License
|
84 |
+
|
85 |
+
According to [this page](https://github.com/wenet-e2e/wespeaker/blob/master/docs/pretrained.md):
|
86 |
+
|
87 |
+
> The pretrained model in WeNet follows the license of it's corresponding dataset. For example, the pretrained model on VoxCeleb follows Creative Commons Attribution 4.0 International License., since it is used as license of the VoxCeleb dataset, see https://mm.kaist.ac.kr/datasets/voxceleb/.
|
88 |
+
|
89 |
+
## Citation
|
90 |
+
|
91 |
+
```bibtex
|
92 |
+
@inproceedings{Wang2023,
|
93 |
+
title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
|
94 |
+
author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
|
95 |
+
booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
|
96 |
+
pages={1--5},
|
97 |
+
year={2023},
|
98 |
+
organization={IEEE}
|
99 |
+
}
|
100 |
+
```
|
101 |
+
|
102 |
+
```bibtex
|
103 |
+
@inproceedings{Bredin23,
|
104 |
+
author={Hervé Bredin},
|
105 |
+
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
|
106 |
+
year=2023,
|
107 |
+
booktitle={Proc. INTERSPEECH 2023},
|
108 |
+
pages={1983--1987},
|
109 |
+
doi={10.21437/Interspeech.2023-105}
|
110 |
+
}
|
111 |
+
```
|
config.yaml
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
model:
|
2 |
+
_target_: pyannote.audio.models.embedding.WeSpeakerResNet34
|
3 |
+
sample_rate: 16000
|
4 |
+
num_channels: 1
|
5 |
+
num_mel_bins: 80
|
6 |
+
frame_length: 25
|
7 |
+
frame_shift: 10
|
8 |
+
dither: 0.0
|
9 |
+
window_type: hamming
|
10 |
+
use_energy: false
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:366edf44f4c80889a3eb7a9d7bdf02c4aede3127f7dd15e274dcdb826b143c56
|
3 |
+
size 26645418
|