How can I use this model in inference?

by xuqiantong - opened Jul 19, 2022

Jul 19, 2022

Hi Anton,

Thanks for sharing this model.

I have a question about using this model in inference. Suppose I have a single-channel audio that has 2 speakers talking without overlapping. What kind of output should I expect from this model? How can I tell which part of the audio is spoken by which speaker?

Looking forward to your reply.

Thanks,
Qiantong

xuqiantong

Jul 19, 2022

The output will be a 2d tensor, with [shape sequence_length, 2]. My understading is that I could get the frame-level prediction by applying torch.sigmoid(output) > 0.5? However, the output looks a bit messy on my test sample.

ChS

Jul 20, 2022

•

edited Jul 20, 2022

Hi Anton,

Thanks a lot for sharing this model. I am posting my question here since it is related to what Qiantong has asked.
I can see that the output is of shape (num_frames, num_speakers). Could you please guide us on how to map each frame to its corresponding time-stamp? Or better said how to chunk the audio to pieces based on speakers.

Best
Chakka

shripadbhat

Oct 5, 2022

Did anyone find a solution to map model outputs to timestamps ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment