Running locally

#3
by alexgo84 - opened

This solution works really well! I've tried to combine Whisper with Pyannote Audio speaker recognition but had poor results, probably because of some oversights on my part.

I would like to run your code locally, should it be possible? I don't see where the Pyannote auth token is provided (sorry for the noob question).

So I found how to add the authentication token and it seems to work. For some reason though I get very different results than the ones I get when using the 'App'. I am using a smaller Whisper model (medium), but the main problem lies in the diarization part. The saparation to speaker 1 / speaker 2 is completely wrong. Do you think that using the smaller Whisper model can account for that?

@alexgo84 I just ran this in a Colab notebook set to GPU: Premium using the large-v2 Whisper model. It took about 12 minutes for a 60 minute audio and the transcription came out great.
image.png

@alaffere So I can confirm that using a different Whisper model size changes the final diarization output.
I'll mention that the videos I'm transcribing are mainly in the Russian and Ukranian languages, maybe that's why the diarization far from perfect.

@alexgo84 I just ran this in a Colab notebook set to GPU: Premium using the large-v2 Whisper model. It took about 12 minutes for a 60 minute audio and the transcription came out great.
image.png

Hugging face is too slow, It would be very great if you could share your colab notebook. Thank you so much!

This solution works really well! I've tried to combine Whisper with Pyannote Audio speaker recognition but had poor results, probably because of some oversights on my part.

I would like to run your code locally, should it be possible? I don't see where the Pyannote auth token is provided (sorry for the noob question).

Any chance you can tell me how you got this working locally? How did you add the token and how exactly do you run it (python app.py .... how do I pass in the audio file?)?

For the place to put the Huggingface token search the code for use_auth_token="-".

I actually had some problem with CUDA compatibility and had to use the CPU. After isntalling the dependencies I wrote a small wrapper around the code of personal use.
You can find the code and instructions in a small GitHub repo, wrapped with a small backend. Pre-requisites should be similar:

https://github.com/alexgo84/video-transcribe

I've also had great results with whisperX (with CUDA) that integrates pyannote as well as doing allignment on words to produce high quality subtitles. In this repo the mini project in wrapped with a simple Flask server (no auth or anything fance):

https://github.com/alexgo84/whisperx-server

For the place to put the Huggingface token search the code for use_auth_token="-".

I actually had some problem with CUDA compatibility and had to use the CPU. After isntalling the dependencies I wrote a small wrapper around the code of personal use.
You can find the code and instructions in a small GitHub repo, wrapped with a small backend. Pre-requisites should be similar:

https://github.com/alexgo84/video-transcribe

I've also had great results with whisperX (with CUDA) that integrates pyannote as well as doing allignment on words to produce high quality subtitles. In this repo the mini project in wrapped with a simple Flask server (no auth or anything fance):

https://github.com/alexgo84/whisperx-server

Hello Alex!
Can you please share colab notebook please?

Sign up or log in to comment