Can Whisper be used for real-time speech to text?

#76
by ImagineThis - opened

Hi. New to Hugging Face. Thanks to everyone who is working on this app. It's great, and I hope you continue working on it. It could help so many people.

I am trying to finish a PhD. When I started writing, I sustained a serious shoulder injury, and got RSI too. This means I have to use speech to text to write 150k+ words in 6 months. Current speech to text apps for Mac are terrible because they force users to speak really slow, really clear, or you get many errors.

My tests of your 30 second app based on Whisper amazed me. Incredible.Is it possible to create a real-time speech to text app using Whisper? Like Dragon Dictate? Or is that not possible? If real-time isn't possible, would it be possible to create an app that people to upload audio of a recorded voice for dictation, without any limit on time?

Thanks again for your work.

Hey @ImagineThis ! Welcome to the HuggingFace community! There's an updated space here that can transcribe audio samples of arbitrary length: https://huggingface.co/spaces/sanchit-gandhi/whisper-large-v2

It's not real-time, but you don't have limits on your audio input length :) We're looking into ways of making this faster!

Hope that helps!

fyi, I do use Whisper Cloud APIs for "kindof" realtime transcriptions. My short experience with it mirrors what Whisper docs mention:

  1. Latency depends on model. I HAVE to use large for multilingual support. It is 3x-8x slower than small or tiny e.g
  2. Latency depends on total #emitted tokens. So, you have to chunk your audio input.

Is it English you prefer using?

@sanjaymk908 Yes, English. I've found a few projects that have attempted real-time, but it's all done in Terminal and require a good Nvidia GPU, which I don't have (All Mac).

@sanchit-gandhi Thanks for the links. Hopefully, we will get a proper UI soon for real-time. It will put Dragon out of business.

Thanks both.

For English & Mac, this is a fantastic solution: https://github.com/ggerganov/whisper.cpp

I have validated its central claims on small & medium whisper models:

  1. The stripped down models it creates+uses can be run on a mobile device (not useful for your usecase).
  2. Its latencies are staggeringly low & optimized for Macs
  3. All this is delivered in 2 cpp files (can be easily customized)

I haven't yet tested large models. Ensure you see this in stdout to confirm you indeed are running optimized on your MacBook:
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

You can maybe also try "streaming" mode to start transcribing as soon as the first token is ready: https://twitter.com/joao_gante/status/1641435222679130112

@sanchit-gandhi Thanks for this. I presume this is using Whisper, but it's still not really useful for day to day use on Mac OS .

Ideally, I want to press a button and, with a Whisper app loaded in the background, I can use speech to text in any application.

There are other efforts, it's getting closer anyday. I would even pay someone to make what I just described, and I bet millions of people would love it.

Hey @ImagineThis - you can try using this API which is more than 50x faster than OpenAI's model: https://huggingface.co/spaces/sanchit-gandhi/whisper-jax

You might hit a queue though if there are people in front of you wanting to use the demo.

Have you tried this for MacOS: https://support.apple.com/en-gb/guide/mac-help/mh40584/mac

It states:

On a Mac with Apple silicon, dictation requests are processed on your device in many languages โ€” no internet connection is required.

why not splitting models by languages like VOSK API?
it will increase loading speed significantly

I have created a quick sample live transcription script if you're interested: https://github.com/gaborvecsei/whisper-live-transcription

Super basic idea but based on a few subjective tests, it looked all right, even with the tiny model:

1st second: [t, 0, 0, 0] --> "Hi"
2nd second: [t-1, t, 0, 0] --> "Hi I am"
3rd second: [t-2, t-1, t, 0] --> "Hi I am the one"
4th second: [t-3, t-2, t-1, t] --> "Hi I am the one and only Gabor"
5th second: [t, 0, 0, 0] --> "How" --> Here we started the process again, and the output is in a new line
6th second: [t-1, t, 0, 0] --> "How are"
etc....


Here's an updated guide for running the model using your microphone in real-time: https://huggingface.co/learn/audio-course/chapter7/voice-assistant#speech-transcription

I had the same problem and so Iโ€™ve created a working proof of concept for real time transcription with a websocket server and a demo JS client here if you want to check it out
https://github.com/alesaccoia/VoiceStreamAI

Sign up or log in to comment