openai/whisper · Can Whisper be used for real-time speech to text?

Mar 20, 2023

Hi. New to Hugging Face. Thanks to everyone who is working on this app. It's great, and I hope you continue working on it. It could help so many people.

I am trying to finish a PhD. When I started writing, I sustained a serious shoulder injury, and got RSI too. This means I have to use speech to text to write 150k+ words in 6 months. Current speech to text apps for Mac are terrible because they force users to speak really slow, really clear, or you get many errors.

My tests of your 30 second app based on Whisper amazed me. Incredible.Is it possible to create a real-time speech to text app using Whisper? Like Dragon Dictate? Or is that not possible? If real-time isn't possible, would it be possible to create an app that people to upload audio of a recorded voice for dictation, without any limit on time?

Thanks again for your work.

sanchit-gandhi

Mar 24, 2023

Hey @ImagineThis ! Welcome to the HuggingFace community! There's an updated space here that can transcribe audio samples of arbitrary length: https://huggingface.co/spaces/sanchit-gandhi/whisper-large-v2

It's not real-time, but you don't have limits on your audio input length :) We're looking into ways of making this faster!

Hope that helps!

sanjaymk908

Mar 24, 2023

fyi, I do use Whisper Cloud APIs for "kindof" realtime transcriptions. My short experience with it mirrors what Whisper docs mention:

Latency depends on model. I HAVE to use large for multilingual support. It is 3x-8x slower than small or tiny e.g
Latency depends on total #emitted tokens. So, you have to chunk your audio input.

Is it English you prefer using?

ImagineThis

Mar 24, 2023

@sanjaymk908 Yes, English. I've found a few projects that have attempted real-time, but it's all done in Terminal and require a good Nvidia GPU, which I don't have (All Mac).

@sanchit-gandhi Thanks for the links. Hopefully, we will get a proper UI soon for real-time. It will put Dragon out of business.

Thanks both.

sanjaymk908

Mar 24, 2023

For English & Mac, this is a fantastic solution: https://github.com/ggerganov/whisper.cpp

I have validated its central claims on small & medium whisper models:

The stripped down models it creates+uses can be run on a mobile device (not useful for your usecase).
Its latencies are staggeringly low & optimized for Macs
All this is delivered in 2 cpp files (can be easily customized)

I haven't yet tested large models. Ensure you see this in stdout to confirm you indeed are running optimized on your MacBook:
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

sanchit-gandhi

Apr 4, 2023

You can maybe also try "streaming" mode to start transcribing as soon as the first token is ready: https://twitter.com/joao_gante/status/1641435222679130112

ImagineThis

Apr 4, 2023

@sanchit-gandhi Thanks for this. I presume this is using Whisper, but it's still not really useful for day to day use on Mac OS .

Ideally, I want to press a button and, with a Whisper app loaded in the background, I can use speech to text in any application.

There are other efforts, it's getting closer anyday. I would even pay someone to make what I just described, and I bet millions of people would love it.

sanchit-gandhi

Apr 18, 2023

Hey @ImagineThis - you can try using this API which is more than 50x faster than OpenAI's model: https://huggingface.co/spaces/sanchit-gandhi/whisper-jax

You might hit a queue though if there are people in front of you wanting to use the demo.

Have you tried this for MacOS: https://support.apple.com/en-gb/guide/mac-help/mh40584/mac

It states:

On a Mac with Apple silicon, dictation requests are processed on your device in many languages — no internet connection is required.

botbahlul

Jun 22, 2023

why not splitting models by languages like VOSK API?
it will increase loading speed significantly

gaborvecsei

Aug 30, 2023

•

edited Aug 30, 2023

I have created a quick sample live transcription script if you're interested: https://github.com/gaborvecsei/whisper-live-transcription

Super basic idea but based on a few subjective tests, it looked all right, even with the tiny model:

1st second: [t, 0, 0, 0] --> "Hi"
2nd second: [t-1, t, 0, 0] --> "Hi I am"
3rd second: [t-2, t-1, t, 0] --> "Hi I am the one"
4th second: [t-3, t-2, t-1, t] --> "Hi I am the one and only Gabor"
5th second: [t, 0, 0, 0] --> "How" --> Here we started the process again, and the output is in a new line
6th second: [t-1, t, 0, 0] --> "How are"
etc....

sanchit-gandhi

Sep 6, 2023

Here's an updated guide for running the model using your microphone in real-time: https://huggingface.co/learn/audio-course/chapter7/voice-assistant#speech-transcription

alesaccoia

Dec 26, 2023

I had the same problem and so I’ve created a working proof of concept for real time transcription with a websocket server and a demo JS client here if you want to check it out
https://github.com/alesaccoia/VoiceStreamAI

Can Whisper be used for real-time speech to text?

I have created a quick sample live transcription script if you're interested: https://github.com/gaborvecsei/whisper-live-transcriptionSuper basic idea but based on a few subjective tests, it looked all right, even with the tiny model:

I have created a quick sample live transcription script if you're interested: https://github.com/gaborvecsei/whisper-live-transcription

Super basic idea but based on a few subjective tests, it looked all right, even with the tiny model: