About using whisper model as API

#25
by sanjitaa - opened

I am trying to load the whisper model ( medium ) in the server using Django API and integrate in frontend. How can I do it efficiently to get quick response ( even if there are large users at a single time).

in the Django ecosystem, you can use Celery to asynchronously execute a STT task. To receive the result on the frontend through a websocket connection, for example.

Can I use two different models in server side for translation process also using Celery? I just want to use two different models through API and also want the faster response.

you can run separate workflows with separate models on separate GPUs using env variables. in this case you can skip the loading time of the models and get a faster stt time.
You can also look at faster-whisper, whisperx or FrogBase (whisper-ui) projects on github.

I want to use whisper model for transcribing and another model for translation. Now for complete transcription the data should pass from whisper model to another model and generate translated output. Both models are available in hugging face. I want to use both models for better performance as well. How to implement this through hugging face?

You can start from "deploy" menu-button in right upper corner of each model page on HF.

Sanjitaa - If you are looking to transcribe and then subsequently translate the text the second model might not have to be Whisper. In fact, if you're passing text from the first model to a second model it likely makes sense to use a T5 or another input-output based model (BERT family).

I think in your playbook you're going to then need to convert text back to speech in order to get Whisper to process it a second time for the translation. Text to speech models don't seem to perform very well at this task, giving you some outputs that don't make sense compared to what you fed the model initially.

Yes I am using two different models. For transcribing , I am using whisper and for translation, I am using another model ( like mbart ) . How can I do it through hugging face ?

I wanted to use two different models ( whisper for transcription and mbart for text translation). The audio is passed through whisper model and the transcribed text from the whisper model is passed through mart model and text is translated. I want to use both these models through hugging face platform . How can I achieve good performance on it ? I want to display output in frontend also.

Can anyone help me with it ?

This looks very similar to this guide, except the second model is text->text, instead of text->speech.

@sanchit-gandhi Can I use the API provided by the hugging face space ( after I deploy my model ) in my project so that it can be consumed by frontend as well ?
Screenshot 2023-10-10 at 10.55.08.png

Yes you should be able to use the Gradio client this way! You can pass the input string as the path to your audio file. The client will send the audio to the Space, transcribe the audio, and return to you the text output. Let me know if you encounter any difficulties!

@sanchit-gandhi It works ! Thank you .

@sanchit-gandhi Does it work for application also ? I am using this API to build an application.

Sign up or log in to comment