KIFF commited on
Commit
f516bd1
·
1 Parent(s): 7c12535

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -9
README.md CHANGED
@@ -43,27 +43,78 @@ with open("audio.rttm", "w") as rttm:
43
 
44
  ## Advanced usage
45
 
46
- In case the number of speakers is known in advance, one can use the `num_speakers` option:
47
 
48
  ```python
49
- diarization = pipeline("audio.wav", num_speakers=2)
 
50
  ```
51
 
52
- One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
53
 
54
  ```python
55
- diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
 
56
  ```
57
 
58
- If you feel adventurous, you can try and play with the various pipeline hyper-parameters.
59
- For instance, one can use a more aggressive voice activity detection by increasing the value of `segmentation_onset` threshold:
60
 
61
  ```python
62
- hparams = pipeline.parameters(instantiated=True)
63
  hparams["segmentation_onset"] += 0.1
64
- pipeline.instantiate(hparams)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  ```
66
-
67
  ## Benchmark
68
 
69
  ### Real-time factor
 
43
 
44
  ## Advanced usage
45
 
46
+ If the number of speakers is known in advance, you can include the num_speakers parameter in the parameters dictionary:
47
 
48
  ```python
49
+ handler = EndpointHandler()
50
+ diarization = handler({"inputs": base64_audio, "parameters": {"num_speakers": 2}})
51
  ```
52
 
53
+ You can also provide lower and/or upper bounds on the number of speakers using the min_speakers and max_speakers parameters:
54
 
55
  ```python
56
+ handler = EndpointHandler()
57
+ diarization = handler({"inputs": base64_audio, "parameters": {"min_speakers": 2, "max_speakers": 5}})
58
  ```
59
 
60
+ If you're feeling adventurous, you can experiment with various pipeline hyperparameters.
61
+ For instance, you can use a more aggressive voice activity detection by increasing the value of segmentation_onset threshold:
62
 
63
  ```python
64
+ hparams = handler.pipeline.parameters(instantiated=True)
65
  hparams["segmentation_onset"] += 0.1
66
+ handler.pipeline.instantiate(hparams)
67
+ ```
68
+ To apply the updated handler for the API inference that can handle the number of speakers, use the following code:
69
+ ```python
70
+ from typing import Dict
71
+ from pyannote.audio import Pipeline
72
+ import torch
73
+ import base64
74
+ import numpy as np
75
+
76
+ SAMPLE_RATE = 16000
77
+
78
+ class EndpointHandler():
79
+ def __init__(self, path=""):
80
+ # load the model
81
+ self.pipeline = Pipeline.from_pretrained("KIFF/pyannote-speaker-diarization-endpoint")
82
+
83
+ def __call__(self, data: Dict[str, bytes]) -> Dict[str, str]:
84
+ """
85
+ Args:
86
+ data (:obj:):
87
+ includes the deserialized audio file as bytes
88
+ Return:
89
+ A :obj:`dict`:. base64 encoded image
90
+ """
91
+ # process input
92
+ inputs = data.pop("inputs", data)
93
+ parameters = data.pop("parameters", None) # min_speakers=2, max_speakers=5
94
+
95
+ # decode the base64 audio data
96
+ audio_data = base64.b64decode(inputs)
97
+ audio_nparray = np.frombuffer(audio_data, dtype=np.int16)
98
+
99
+ # prepare pynannote input
100
+ audio_tensor= torch.from_numpy(audio_nparray).float().unsqueeze(0)
101
+ pyannote_input = {"waveform": audio_tensor, "sample_rate": SAMPLE_RATE}
102
+
103
+ # apply pretrained pipeline
104
+ # pass inputs with all kwargs in data
105
+ if parameters is not None:
106
+ diarization = self.pipeline(pyannote_input, **parameters)
107
+ else:
108
+ diarization = self.pipeline(pyannote_input)
109
+
110
+ # postprocess the prediction
111
+ processed_diarization = [
112
+ {"label": str(label), "start": str(segment.start), "stop": str(segment.end)}
113
+ for segment, _, label in diarization.itertracks(yield_label=True)
114
+ ]
115
+
116
+ return {"diarization": processed_diarization}
117
  ```
 
118
  ## Benchmark
119
 
120
  ### Real-time factor