--- tags: - pyannote - pyannote-audio - pyannote-audio-pipeline - audio - voice - speech - speaker - speaker-diarization - speaker-separation - speech-separation license: mit extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this pipeline uses MIT license and will always remain open-source, we will occasionnally email you about premium pipelines and paid services around pyannote." extra_gated_fields: Company/university: text Website: text --- Using this open-source pipeline in production? Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options. # đŸŽč PixIT / joint speaker diarization and speech separation This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance and speech separation as a [`SlidingWindowFeature`](http://pyannote.github.io/pyannote-core/reference.html#pyannote.core.SlidingWindowFeature). Audio files sampled at a different rate are resampled to 16kHz automatically upon loading. ![Pipeline](pipeline.png) It has been trained by [Joonas Kalda](https://www.linkedin.com/in/joonas-kalda-996499133) with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.3.2` using the [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) dataset (single distant microphone, SDM). These [paper](https://www.isca-archive.org/odyssey_2024/kalda24_odyssey.html) and [companion repository](https://github.com/joonaskalda/PixIT) describe the approach in more details. ## Requirements 1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.3.2` with `pip install pyannote.audio[separation]==3.3.2` 2. Accept [`pyannote/separation-ami-1.0`](https://hf.co/pyannote/separation-ami-1.0) user conditions 3. Accept [`pyannote/speech-separation-ami-1.0`](https://hf.co/pyannote/speech-separation-ami-1.0) user conditions 4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens). ## Usage ```python # instantiate the pipeline from pyannote.audio import Pipeline pipeline = Pipeline.from_pretrained( "pyannote/speech-separation-ami-1.0", use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE") # run the pipeline on an audio file diarization, sources = pipeline("audio.wav") # dump the diarization output to disk using RTTM format with open("audio.rttm", "w") as rttm: diarization.write_rttm(rttm) # dump sources to disk as SPEAKER_XX.wav files import scipy.io.wavfile for s, speaker in enumerate(diarization.labels()): scipy.io.wavfile.write(f'{speaker}.wav', 16000, sources.data[:,s]) ``` ### Processing on GPU `pyannote.audio` pipelines run on CPU by default. You can send them to GPU with the following lines: ```python import torch pipeline.to(torch.device("cuda")) ``` ### Processing from memory Pre-loading audio files in memory may result in faster processing: ```python waveform, sample_rate = torchaudio.load("audio.wav") diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate}) ``` ### Monitoring progress Hooks are available to monitor the progress of the pipeline: ```python from pyannote.audio.pipelines.utils.hook import ProgressHook with ProgressHook() as hook: diarization = pipeline("audio.wav", hook=hook) ``` ## Citations ```bibtex @inproceedings{Kalda24, author={Joonas Kalda and ClĂ©ment PagĂ©s and Ricard Marxer and Tanel AlumĂ€e and HervĂ© Bredin}, title={{PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings}}, year=2024, booktitle={Proc. Odyssey 2024}, } ``` ```bibtex @inproceedings{Bredin23, author={HervĂ© Bredin}, title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, } ```