Radu-Sebastian Amarie commited on
Commit
6fac4b0
1 Parent(s): b237a51

ADD: Speaker Diarization Pipeline with Speechbrain embedding

Browse files
Files changed (3) hide show
  1. README.md +145 -0
  2. config.yaml +19 -0
  3. handler.py +58 -0
README.md CHANGED
@@ -1,3 +1,148 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ tags:
3
+ - pyannote
4
+ - pyannote-audio
5
+ - pyannote-audio-pipeline
6
+ - audio
7
+ - voice
8
+ - speech
9
+ - speaker
10
+ - speaker-diarization
11
+ - speaker-change-detection
12
+ - voice-activity-detection
13
+ - overlapped-speech-detection
14
+ - automatic-speech-recognition
15
  license: mit
16
+ extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this pipeline uses MIT license and will always remain open-source, we will occasionnally email you about premium pipelines and paid services around pyannote."
17
+ extra_gated_fields:
18
+ Company/university: text
19
+ Website: text
20
  ---
21
+
22
+ Using this open-source model in production?
23
+ Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.
24
+
25
+ # 🎹 Speaker diarization 3.1
26
+
27
+ This pipeline is the same as [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote/speaker-diarization-3.1) except it removes the [problematic](https://github.com/pyannote/pyannote-audio/issues/1537) use of `onnxruntime`.
28
+ Both speaker segmentation and embedding now run in pure PyTorch. This should ease deployment and possibly speed up inference.
29
+ It requires pyannote.audio version 3.1 or higher.
30
+
31
+ It ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance:
32
+
33
+ - stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
34
+ - audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
35
+
36
+ ## Requirements
37
+
38
+ 1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.1` with `pip install pyannote.audio`
39
+ 2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
40
+ 3. Accept [`pyannote/speaker-diarization-3.1`](https://hf.co/pyannote/speaker-diarization-3.1) user conditions
41
+ 4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
42
+
43
+ ## Usage
44
+
45
+ ```python
46
+ # instantiate the pipeline
47
+ from pyannote.audio import Pipeline
48
+ pipeline = Pipeline.from_pretrained(
49
+ "pyannote/speaker-diarization-3.1",
50
+ use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
51
+
52
+ # run the pipeline on an audio file
53
+ diarization = pipeline("audio.wav")
54
+
55
+ # dump the diarization output to disk using RTTM format
56
+ with open("audio.rttm", "w") as rttm:
57
+ diarization.write_rttm(rttm)
58
+ ```
59
+
60
+ ### Processing on GPU
61
+
62
+ `pyannote.audio` pipelines run on CPU by default.
63
+ You can send them to GPU with the following lines:
64
+
65
+ ```python
66
+ import torch
67
+ pipeline.to(torch.device("cuda"))
68
+ ```
69
+
70
+ ### Processing from memory
71
+
72
+ Pre-loading audio files in memory may result in faster processing:
73
+
74
+ ```python
75
+ waveform, sample_rate = torchaudio.load("audio.wav")
76
+ diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
77
+ ```
78
+
79
+ ### Monitoring progress
80
+
81
+ Hooks are available to monitor the progress of the pipeline:
82
+
83
+ ```python
84
+ from pyannote.audio.pipelines.utils.hook import ProgressHook
85
+ with ProgressHook() as hook:
86
+ diarization = pipeline("audio.wav", hook=hook)
87
+ ```
88
+
89
+ ### Controlling the number of speakers
90
+
91
+ In case the number of speakers is known in advance, one can use the `num_speakers` option:
92
+
93
+ ```python
94
+ diarization = pipeline("audio.wav", num_speakers=2)
95
+ ```
96
+
97
+ One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
98
+
99
+ ```python
100
+ diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
101
+ ```
102
+
103
+ ## Benchmark
104
+
105
+ This pipeline has been benchmarked on a large collection of datasets.
106
+
107
+ Processing is fully automatic:
108
+
109
+ - no manual voice activity detection (as is sometimes the case in the literature)
110
+ - no manual number of speakers (though it is possible to provide it to the pipeline)
111
+ - no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset
112
+
113
+ ... with the least forgiving diarization error rate (DER) setup (named _"Full"_ in [this paper](https://doi.org/10.1016/j.csl.2021.101254)):
114
+
115
+ - no forgiveness collar
116
+ - evaluation of overlapped speech
117
+
118
+ | Benchmark | [DER%](. "Diarization error rate") | [FA%](. "False alarm rate") | [Miss%](. "Missed detection rate") | [Conf%](. "Speaker confusion rate") | Expected output | File-level evaluation |
119
+ | ------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- | --------------------------- | ---------------------------------- | ----------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
120
+ | [AISHELL-4](http://www.openslr.org/111/) | 12.2 | 3.8 | 4.4 | 4.0 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.eval) |
121
+ | [AliMeeting (_channel 1_)](https://www.openslr.org/119/) | 24.4 | 4.4 | 10.0 | 10.0 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.eval) |
122
+ | [AMI (_headset mix,_](https://groups.inf.ed.ac.uk/ami/corpus/) [_only_words_)](https://github.com/BUTSpeechFIT/AMI-diarization-setup) | 18.8 | 3.6 | 9.5 | 5.7 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.eval) |
123
+ | [AMI (_array1, channel 1,_](https://groups.inf.ed.ac.uk/ami/corpus/) [_only_words)_](https://github.com/BUTSpeechFIT/AMI-diarization-setup) | 22.4 | 3.8 | 11.2 | 7.5 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.eval) |
124
+ | [AVA-AVD](https://arxiv.org/abs/2111.14448) | 50.0 | 10.8 | 15.7 | 23.4 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.eval) |
125
+ | [DIHARD 3 (_Full_)](https://arxiv.org/abs/2012.01477) | 21.7 | 6.2 | 8.1 | 7.3 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.eval) |
126
+ | [MSDWild](https://x-lance.github.io/MSDWILD/) | 25.3 | 5.8 | 8.0 | 11.5 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.eval) |
127
+ | [REPERE (_phase 2_)](https://islrn.org/resources/360-758-359-485-0/) | 7.8 | 1.8 | 2.6 | 3.5 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.eval) |
128
+ | [VoxConverse (_v0.3_)](https://github.com/joonson/voxconverse) | 11.3 | 4.1 | 3.4 | 3.8 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.eval) |
129
+
130
+ ## Citations
131
+
132
+ ```bibtex
133
+ @inproceedings{Plaquet23,
134
+ author={Alexis Plaquet and Hervé Bredin},
135
+ title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
136
+ year=2023,
137
+ booktitle={Proc. INTERSPEECH 2023},
138
+ }
139
+ ```
140
+
141
+ ```bibtex
142
+ @inproceedings{Bredin23,
143
+ author={Hervé Bredin},
144
+ title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
145
+ year=2023,
146
+ booktitle={Proc. INTERSPEECH 2023},
147
+ }
148
+ ```
config.yaml ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: 3.1.1
2
+
3
+ pipeline:
4
+ name: pyannote.audio.pipelines.SpeakerDiarization
5
+ params:
6
+ clustering: AgglomerativeClustering
7
+ embedding: speechbrain/spkrec-ecapa-voxceleb
8
+ embedding_batch_size: 32
9
+ embedding_exclude_overlap: true
10
+ segmentation: pyannote/segmentation-3.0
11
+ segmentation_batch_size: 32
12
+
13
+ params:
14
+ clustering:
15
+ method: centroid
16
+ min_cluster_size: 12
17
+ threshold: 0.7045654963945799
18
+ segmentation:
19
+ min_duration_off: 0.0
handler.py ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MIT License
2
+ #
3
+ # Copyright (c) 2023 CNRS
4
+ #
5
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ # of this software and associated documentation files (the "Software"), to deal
7
+ # in the Software without restriction, including without limitation the rights
8
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ # copies of the Software, and to permit persons to whom the Software is
10
+ # furnished to do so, subject to the following conditions:
11
+ #
12
+ # The above copyright notice and this permission notice shall be included in all
13
+ # copies or substantial portions of the Software.
14
+ #
15
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ # SOFTWARE.
22
+
23
+
24
+ from pyannote.audio import Pipeline, Audio
25
+ import torch
26
+
27
+
28
+ class EndpointHandler:
29
+ def __init__(self, path=""):
30
+ # initialize pretrained pipeline
31
+ self._pipeline = Pipeline.from_pretrained("eek/speaker-diarization")
32
+
33
+ # send pipeline to GPU if available
34
+ if torch.cuda.is_available():
35
+ self._pipeline.to(torch.device("cuda"))
36
+
37
+ # initialize audio reader
38
+ self._io = Audio()
39
+
40
+ def __call__(self, data):
41
+ inputs = data.pop("inputs", data)
42
+ waveform, sample_rate = self._io(inputs)
43
+
44
+ parameters = data.pop("parameters", dict())
45
+ diarization = self.pipeline(
46
+ {"waveform": waveform, "sample_rate": sample_rate}, **parameters
47
+ )
48
+
49
+ processed_diarization = [
50
+ {
51
+ "speaker": speaker,
52
+ "start": f"{turn.start:.3f}",
53
+ "end": f"{turn.end:.3f}",
54
+ }
55
+ for turn, _, speaker in diarization.itertracks(yield_label=True)
56
+ ]
57
+
58
+ return {"diarization": processed_diarization}