bhuvanesh25 commited on
Commit
8171e37
1 Parent(s): bda0077

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +140 -0
  2. config.yaml +18 -0
  3. technical_report_2.1.pdf +0 -0
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - pyannote
4
+ - pyannote-audio
5
+ - pyannote-audio-pipeline
6
+ - audio
7
+ - voice
8
+ - speech
9
+ - speaker
10
+ - speaker-diarization
11
+ - speaker-change-detection
12
+ - voice-activity-detection
13
+ - overlapped-speech-detection
14
+ - automatic-speech-recognition
15
+ datasets:
16
+ - ami
17
+ - dihard
18
+ - voxconverse
19
+ - aishell
20
+ - repere
21
+ - voxceleb
22
+ license: mit
23
+ extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers apply for grants to improve it further. If you are an academic researcher, please cite the relevant papers in your own publications using the model. If you work for a company, please consider contributing back to pyannote.audio development (e.g. through unrestricted gifts). We also provide scientific consulting services around speaker diarization and machine listening."
24
+ extra_gated_fields:
25
+ Company/university: text
26
+ Website: text
27
+ I plan to use this model for (task, type of audio data, etc): text
28
+ ---
29
+
30
+ # 🎹 Speaker diarization
31
+
32
+ Relies on pyannote.audio 2.1.1: see [installation instructions](https://github.com/pyannote/pyannote-audio#installation).
33
+
34
+ ## TL;DR
35
+
36
+ ```python
37
+ # 1. visit hf.co/pyannote/speaker-diarization and accept user conditions
38
+ # 2. visit hf.co/pyannote/segmentation and accept user conditions
39
+ # 3. visit hf.co/settings/tokens to create an access token
40
+ # 4. instantiate pretrained speaker diarization pipeline
41
+ from pyannote.audio import Pipeline
42
+ pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1",
43
+ use_auth_token="ACCESS_TOKEN_GOES_HERE")
44
+
45
+
46
+ # apply the pipeline to an audio file
47
+ diarization = pipeline("audio.wav")
48
+
49
+ # dump the diarization output to disk using RTTM format
50
+ with open("audio.rttm", "w") as rttm:
51
+ diarization.write_rttm(rttm)
52
+ ```
53
+
54
+ ## Advanced usage
55
+
56
+ In case the number of speakers is known in advance, one can use the `num_speakers` option:
57
+
58
+ ```python
59
+ diarization = pipeline("audio.wav", num_speakers=2)
60
+ ```
61
+
62
+ One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
63
+
64
+ ```python
65
+ diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
66
+ ```
67
+
68
+ ## Benchmark
69
+
70
+ ### Real-time factor
71
+
72
+ Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).
73
+
74
+ In other words, it takes approximately 1.5 minutes to process a one hour conversation.
75
+
76
+ ### Accuracy
77
+
78
+ This pipeline is benchmarked on a growing collection of datasets.
79
+
80
+ Processing is fully automatic:
81
+
82
+ * no manual voice activity detection (as is sometimes the case in the literature)
83
+ * no manual number of speakers (though it is possible to provide it to the pipeline)
84
+ * no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset
85
+
86
+ ... with the least forgiving diarization error rate (DER) setup (named *"Full"* in [this paper](https://doi.org/10.1016/j.csl.2021.101254)):
87
+
88
+ * no forgiveness collar
89
+ * evaluation of overlapped speech
90
+
91
+
92
+ | Benchmark | [DER%](. "Diarization error rate") | [FA%](. "False alarm rate") | [Miss%](. "Missed detection rate") | [Conf%](. "Speaker confusion rate") | Expected output | File-level evaluation |
93
+ | ------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- | --------------------------- | ---------------------------------- | ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
94
+ | [AISHELL-4](http://www.openslr.org/111/) | 14.09 | 5.17 | 3.27 | 5.65 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AISHELL.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AISHELL.test.eval) |
95
+ | [Albayzin (*RTVE 2022*)](http://catedrartve.unizar.es/albayzindatabases.html) | 25.60 | 5.58 | 6.84 | 13.18 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/Albayzin2022.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/Albayzin2022.test.eval) |
96
+ | [AliMeeting (*channel 1*)](https://www.openslr.org/119/) | 27.42 | 4.84 | 14.00 | 8.58 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AliMeeting.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AliMeeting.test.eval) |
97
+ | [AMI (*headset mix,*](https://groups.inf.ed.ac.uk/ami/corpus/) [*only_words*)](https://github.com/BUTSpeechFIT/AMI-diarization-setup) | 18.91 | 4.48 | 9.51 | 4.91 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AMI.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AMI.test.eval) |
98
+ | [AMI (*array1, channel 1,*](https://groups.inf.ed.ac.uk/ami/corpus/) [*only_words)*](https://github.com/BUTSpeechFIT/AMI-diarization-setup) | 27.12 | 4.11 | 17.78 | 5.23 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AMI-SDM.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AMI-SDM.test.eval) |
99
+ | [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) [(*part2*)](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1) | 32.37 | 6.30 | 13.72 | 12.35 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/CALLHOME.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/CALLHOME.test.eval) |
100
+ | [DIHARD 3 (*Full*)](https://arxiv.org/abs/2012.01477) | 26.94 | 10.50 | 8.41 | 8.03 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/DIHARD.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/DIHARD.test.eval) |
101
+ | [Ego4D *v1 (validation)*](https://arxiv.org/abs/2110.07058) | 63.99 | 3.91 | 44.42 | 15.67 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/Ego4D.development.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/Ego4D.development.eval) |
102
+ | [REPERE (*phase 2*)](https://islrn.org/resources/360-758-359-485-0/) | 8.17 | 2.23 | 2.49 | 3.45 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/REPERE.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/REPERE.test.eval) |
103
+ | [This American Life](https://arxiv.org/abs/2005.08072) | 20.82 | 2.03 | 11.89 | 6.90 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/ThisAmericanLife.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/ThisAmericanLife.test.eval) |
104
+ | [VoxConverse (*v0.3*)](https://github.com/joonson/voxconverse) | 11.24 | 4.42 | 2.88 | 3.94 | [RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/main/reproducible_research/2.1.1/VoxConverse.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization/blob/main/reproducible_research/2.1.1/VoxConverse.test.eval) |
105
+
106
+ ## Technical report
107
+
108
+ This [report](technical_report_2.1.pdf) describes the main principles behind version `2.1` of pyannote.audio speaker diarization pipeline.
109
+ It also provides recipes explaining how to adapt the pipeline to your own set of annotated data. In particular, those are applied to the above benchmark and consistently leads to significant performance improvement over the above out-of-the-box performance.
110
+
111
+
112
+ ## Support
113
+
114
+ For commercial enquiries and scientific consulting, please contact [me](mailto:herve@niderb.fr).
115
+ For [technical questions](https://github.com/pyannote/pyannote-audio/discussions) and [bug reports](https://github.com/pyannote/pyannote-audio/issues), please check [pyannote.audio](https://github.com/pyannote/pyannote-audio) Github repository.
116
+
117
+
118
+ ## Citations
119
+
120
+ ```bibtex
121
+ @inproceedings{Bredin2021,
122
+ Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
123
+ Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
124
+ Booktitle = {Proc. Interspeech 2021},
125
+ Address = {Brno, Czech Republic},
126
+ Month = {August},
127
+ Year = {2021},
128
+ }
129
+ ```
130
+
131
+ ```bibtex
132
+ @inproceedings{Bredin2020,
133
+ Title = {{pyannote.audio: neural building blocks for speaker diarization}},
134
+ Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
135
+ Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
136
+ Address = {Barcelona, Spain},
137
+ Month = {May},
138
+ Year = {2020},
139
+ }
140
+ ```
config.yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ pipeline:
2
+ name: pyannote.audio.pipelines.SpeakerDiarization
3
+ params:
4
+ clustering: AgglomerativeClustering
5
+ embedding: speechbrain/spkrec-ecapa-voxceleb
6
+ embedding_batch_size: 32
7
+ embedding_exclude_overlap: true
8
+ segmentation: pyannote/segmentation@2022.07
9
+ segmentation_batch_size: 32
10
+
11
+ params:
12
+ clustering:
13
+ method: centroid
14
+ min_cluster_size: 15
15
+ threshold: 0.7153814381597874
16
+ segmentation:
17
+ min_duration_off: 0.5817029604921046
18
+ threshold: 0.4442333667381752
technical_report_2.1.pdf ADDED
Binary file (372 kB). View file