🎹 Speaker segmentation


Model from End-to-end speaker segmentation for overlap-aware resegmentation,
by Hervé Bredin and Antoine Laurent.

Online demo is available as a Hugging Face Space.


For commercial enquiries and scientific consulting, please contact me.
For technical questions and bug reports, please check pyannote.audio Github repository.


Relies on pyannote.audio 2.0 currently in development: see installation instructions.

Voice activity detection

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation="pyannote/segmentation")
  # onset/offset activation thresholds
  "onset": 0.5, "offset": 0.5,
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions

Overlapped speech detection

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation="pyannote/segmentation")
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions


from pyannote.audio.pipelines import Resegmentation
pipeline = Resegmentation(segmentation="pyannote/segmentation", 
resegmented_baseline = pipeline({"audio": "audio.wav", "baseline": baseline})
# where `baseline` should be provided as a pyannote.core.Annotation instance

Raw scores

from pyannote.audio import Inference
inference = Inference("pyannote/segmentation")
segmentation = inference("audio.wav")
# `segmentation` is a pyannote.core.SlidingWindowFeature
# instance containing raw segmentation scores like the 
# one pictured above (output)

Reproducible research

In order to reproduce the results of the paper "End-to-end speaker segmentation for overlap-aware resegmentation ", use the following hyper-parameters:

Voice activity detection onset offset min_duration_on min_duration_off
AMI Mix-Headset 0.684 0.577 0.181 0.037
DIHARD3 0.767 0.377 0.136 0.067
VoxConverse 0.767 0.713 0.182 0.501
Overlapped speech detection onset offset min_duration_on min_duration_off
AMI Mix-Headset 0.448 0.362 0.116 0.187
DIHARD3 0.430 0.320 0.091 0.144
VoxConverse 0.587 0.426 0.337 0.112
Resegmentation of VBx onset offset min_duration_on min_duration_off
AMI Mix-Headset 0.542 0.527 0.044 0.705
DIHARD3 0.592 0.489 0.163 0.182
VoxConverse 0.537 0.724 0.410 0.563

Expected outputs (and VBx baseline) are also provided in the /reproducible_research sub-directories.


  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Address = {Brno, Czech Republic},
  Month = {August},
  Year = {2021},
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Address = {Barcelona, Spain},
  Month = {May},
  Year = {2020},
Downloads last month
Hosted inference API

Inference API has been turned off for this model.

Dataset used to train pyannote/segmentation