How can I use this model in inference?

#1
by xuqiantong - opened

Hi Anton,

Thanks for sharing this model.

I have a question about using this model in inference. Suppose I have a single-channel audio that has 2 speakers talking without overlapping. What kind of output should I expect from this model? How can I tell which part of the audio is spoken by which speaker?

Looking forward to your reply.

Thanks,
Qiantong

The output will be a 2d tensor, with [shape sequence_length, 2]. My understading is that I could get the frame-level prediction by applying torch.sigmoid(output) > 0.5? However, the output looks a bit messy on my test sample.

Hi Anton,

Thanks a lot for sharing this model. I am posting my question here since it is related to what Qiantong has asked.
I can see that the output is of shape (num_frames, num_speakers). Could you please guide us on how to map each frame to its corresponding time-stamp? Or better said how to chunk the audio to pieces based on speakers.

Best
Chakka

Hi

Did anyone find a solution to map model outputs to timestamps ?

Sign up or log in to comment