---
tags:
  - pyannote
  - pyannote-audio
  - pyannote-audio-pipeline
  - audio
  - voice
  - speech
  - speaker
  - speaker-diarization
  - speaker-separation
  - speech-separation
license: mit
extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this pipeline uses MIT license and will always remain open-source, we will occasionnally email you about premium pipelines and paid services around pyannote."
extra_gated_fields:
  Company/university: text
  Website: text
---

Using this open-source pipeline in production?  
Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.

# 🎹 PixIT / joint speaker diarization and speech separation

This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance and speech separation as a [`SlidingWindowFeature`](http://pyannote.github.io/pyannote-core/reference.html#pyannote.core.SlidingWindowFeature).

Audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

![Pipeline](pipeline.png)

It has been trained by [Joonas Kalda](https://www.linkedin.com/in/joonas-kalda-996499133) with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.3.2` using the [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) dataset (single distant microphone, SDM). These [paper](https://www.isca-archive.org/odyssey_2024/kalda24_odyssey.html) and [companion repository](https://github.com/joonaskalda/PixIT) describe the approach in more details.

## Requirements

1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.3.2` with `pip install pyannote.audio[separation]==3.3.2`
2. Accept [`pyannote/separation-ami-1.0`](https://hf.co/pyannote/separation-ami-1.0) user conditions
3. Accept [`pyannote/speech-separation-ami-1.0`](https://hf.co/pyannote/speech-separation-ami-1.0) user conditions
4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).

## Usage

```python
# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speech-separation-ami-1.0",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# run the pipeline on an audio file
diarization, sources = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

# dump sources to disk as SPEAKER_XX.wav files
import scipy.io.wavfile
for s, speaker in enumerate(diarization.labels()):
    scipy.io.wavfile.write(f'{speaker}.wav', 16000, sources.data[:,s])
```

### Processing on GPU

`pyannote.audio` pipelines run on CPU by default.
You can send them to GPU with the following lines:

```python
import torch
pipeline.to(torch.device("cuda"))
```

### Processing from memory

Pre-loading audio files in memory may result in faster processing:

```python
waveform, sample_rate = torchaudio.load("audio.wav")
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
```

### Monitoring progress

Hooks are available to monitor the progress of the pipeline:

```python
from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
    diarization = pipeline("audio.wav", hook=hook)
```

## Citations

```bibtex
@inproceedings{Kalda24,
  author={Joonas Kalda and Clément Pagés and Ricard Marxer and Tanel Alumäe and Hervé Bredin},
  title={{PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings}},
  year=2024,
  booktitle={Proc. Odyssey 2024},
}
```

```bibtex
@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}
```